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Abstract. We present an expository, general analysis of valid post-selection or post-regularization 
inference about a low-dimensional target parameter in the presence of a very high-dimensional nuisance 
parameter that is estimated using selection or regularization methods. Our analysis provides a set of 
high-level conditions under which inference for the low-dimensional parameter based on testing or point 
estimation methods will be regular despite selection or regularization biases occurring in the estimation 
of the high-dimensional nuisance parameter. The results may be applied to establish uniform validity 
of post-selection or post-regularization inference procedures for low-dimensional target parameters over 
large classes of models. The high-level conditions allow one to clearly see the types of structure needed 
to achieve valid post-regularization inference and encompass many existing results. A key element 
of the structure we employ and discuss in detail is the use of so-called orthogonal or “immunized” 
estimating equations that are locally insensitive to small mistakes in estimation of the high-dimensional 
nuisance parameter. As an illustration, we use the high-level conditions to provide readily verifiable 
sufficient conditions for a class of affine-quadratic models that include the usual linear model and linear 
instrumental variables model as special cases. As a further application and illustration, we use these 
results to provide an analysis of post-selection inference in a linear instrumental variables model with 
many regressors and many instruments. We conclude with a review of other developments in post¬ 
selection inference and note that many of the developments can be viewed as special cases of the general 
encompassing framework of orthogonal estimating equations provided in this paper. 

Key words: Neyman, orthogonalization, C{a) statistics, optimal instrument, optimal score, optimal 
moment, post-selection and post-regularization inference, efficiency, optimality 


1. Introduction 

Analysis of high-dimensional models, models in which the number of parameters to 
be estimated is large relative to the sample size, is becoming increasingly important. 
Such models arise naturally in readily available high-dimensional data which have many 
measured characteristics available per individual observation as in, for example, large 
survey data sets, scanner data, and text data. Such models also arise naturally even 
in data with a small number of measured characteristics in situations where the exact 
functional form with which the observed variables enter the model is unknown. Examples 
of this scenario include semiparametric models with nonparametric nuisance functions. 
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More generally, models with many parameters relative to the sample size often arise 
when attempting to model complex phenomena. 

The key concept underlying the analysis of high-dimensional models is that regular¬ 
ization, such as model selection or shrinkage of model parameters, is necessary if one 
is to draw meaningful conclusions from the data. For example, the need for regulariza¬ 
tion is obvious in a linear regression model with the number of right-hand-side variables 
greater than the sample size, but arises far more generally in any setting in which the 
number of parameters is not small relative to the sample size. Given the importance of 
the use of regularization in analyzing high-dimensional models, it is then important to 
explicitly account for the impact of this regularization on the behavior of estimators if 
one wishes to accurately characterize their finite-sample behavior. The use of such reg¬ 
ularization techniques may easily invalidate conventional approaches to inference about 
model parameters and other interesting target parameters. A major goal of this paper 
is to present a general, formal framework that provides guidance about setting up esti¬ 
mating equations and making appropriate use of regularization devices so that inference 
about parameters of interest will remain valid in the presence of data-dependent model 
selection or other approaches to regularization. 

It is important to note that understanding estimators’ behavior in high-dimensional 
settings is also useful in conventional low-dimensional settings. As noted above, dealing 
formally with high-dimensional models requires that one explicitly accounts for model 
selection or other forms of regularization. Providing results that explicitly account for 
this regularization then allows us to accommodate and coherently account for the fact 
that low-dimensional models estimated in practice are often the result of specification 
searches. As in the high-dimensional setting, failure to account for this variable selection 
will invalidate the usual inference procedures, whereas the approach that we outline will 
remain valid and can easily be applied in conventional low-dimensional settings. 

The chief goal of this overview paper is to offer a general framework that encompasses 
many existing results regarding inference on model parameters in high-dimensional mod¬ 
els. The encompassing framework we present and the key theoretical results are new, 
although they are clearly heavily influenced and foreshadowed by previous, more special¬ 
ized results. As an application of the framework, we also present new results on inference 
in a reasonably broad class of models, termed affine-quadratic models, that includes the 
usual linear model and linear instrumental variables (IV) model and then apply these 
results to provide new ones regarding post-regularization inference on the parameters 
on endogenous variables in a linear instrumental variables model with very many in¬ 
struments and controls (and also allowing for some misspecification). We also provide 
a discussion of previous research that aims to highlight that many existing results fall 
within the general framework. 

Formally, we present a series of results for obtaining valid inferential statements about 
a low-dimensional parameter of interest, a, in the presence of a high-dimensional nui¬ 
sance parameter r/. The general approach we offer relies on two fundamental elements. 
First, it is important that estimating equations used to draw inferences about a satisfy 
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a key orthogonality or immunization conditionj^ For example, when estimation and 
inference for a are based on the empirical analog of a theoretical system of equations 

M(q;, r\) = 0, 

we show that setting up the equations in a manner such that the orthogonality or 
immunization condition 

rj) = 0 

holds is an important element in providing an inferential procedure for a that remains 
valid when rj is estimated using regularization. We note that this condition can generally 
be established. For example, we can apply Neyman’s classic orthogonalized score in 
likelihood settings; see, e.g. Neyman (1959) and Neyman (1979). We also describe 


an extension of this classic approach to the GMM setting. In general, applying this 
orthogonalization will introduce additional nuisance parameters that will be treated as 
part of rj. 

The second key element of our approach is the use of high-quality, structured esti¬ 
mators of Tj. Crucially, additional structure on r] is needed for informative inference to 
proceed, and it is thus important to use estimation strategies that leverage and perform 
well under the desired structure. An example of a structure that has been usefully em¬ 


ployed in the recent literature is approximate sparsity, e.g. Belloni et al. (2012). Within 


this framework, rj is well approximated by a sparse vector which suggests the use of a 
sparse estimator such as the Lasso (Frank and Friedman (1993) and Tibshirani ( 1996| )). 
The Lasso estimator solves the general problem 

p 

rjL = argmin ^(data, 77 ) -|- A E 

1=1 

where .^(data, rj) is some general loss function that depends on the data and the parameter 
r/, A is a penalty level, and i/^j’s are penalty loadings. The leading example is the usual 
linear model in which ^(data, ??) = ~ is the usual least-squares loss, with 

Hi denoting the outcome of interest for observation i and Xi denoting predictor variables, 
and we provide further discussion of this example in the appendix. Other examples 
of £(data, rj) include suitable loss functions corresponding to well-known M-estimators, 
the negative of the log-likelihood, and GMM criterion functions. This estimator and 


Gandes and Tao 

(2007 


Meinshausen and Yu| ( 

2009), 


Bickel et al. (2009), Belloni and Chernozhukov (2013), and Belloni et al. (2011) are 


computationally efficient and have been shown to have good estimation properties even 
when perfect variable selection is not feasible under approximate sparsity. These good 
estimation properties then translate into providing “good enough” estimates of rj to 
result in valid inference about a when coupled with orthogonal estimating equations as 
discussed above. Finally, it is important to note that the general results we present do 
not require or leverage approximate sparsity or sparsity-based estimation strategies. We 


^We refer to the condition as an orthogonality or immunization condition as orthogonality is a much 
used term and our usage differs from some other usage in defining orthogonality conditions used in 
econometrics. 
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provide this discussion here simply as an example and because the structure offers one 
concrete setting in which the general results we establish may be applied. 


In the remainder of this paper, we present the main results. In Sections 2 and 3, we 
provide our general set of results that may be used to establish uniform validity of infer¬ 
ence about low-dimensional parameters of interest in the presence of high-dimensional 
nuisance parameters. We provide the framework in Section 2, and then discuss how to 
achieve the key orthogonality condition in Section 3. In Sections 4 and 5, we provide 
details about establishing the necessary results for the estimation quality of rj within 
the approximately sparse framework. The analysis in Section 4 pertains to a reasonably 
general class of affine-quadratic models, and the analysis of Section 5 specializes this 
result to the case of estimating the parameters on a vector of endogenous variables in a 
linear instrumental variables model with very many potential control variables and very 


many potential instruments. The analysis in Section 5 thus extends results from Belloni 


et al. (2012) and Belloni, Chernozhukov and Hansen (2014). We also provide a brief 


simulation example and an empirical example that looks at logit demand estimation 
within the linear many instrument and many control setting in Section 5. We conclude 
with a literature review in Section 6 . 


Notation. We use “wp —)■ 1” to abbreviate the phrase “with probability that con¬ 
verges to 1”, and we use the arrows —)-p„ and to denote convergence in probability 
and in distribution under the sequence of probability measures {Pn}- The symbol ~ 
means “distributed as”. The notation a <b means that a = 0(6) and a <p,, 6 means 
that a = Op„( 6 ). The £2 and £i norms are denoted by || • || and || • ||i, respectively; and the 
.^o-“iiorm”, II • ||o, denotes the number of non-zero components of a vector. When applied 
to a matrix, || • || denotes the operator norm. We use the notation a V 6 = max(a, 6 ) and 
a A 6 = min(a, 6 ). Here and below, En[-] abbreviates the average n~^ Z]r=i['] index 
i. That is, Kn[f{wi)] denotes n~^ what follows, we use the m-sparse 

norm of a matrix Q defined as 

IIQIIsp(m) = sup{| 6 'Q 6 |/|| 6 f : || 6 ||o < m, || 6 || / 0 }. 

We also consider the pointwise norm of a square matrix matrix Q at a point x 7 ^ 0: 

IIQIIpw(H = WQx\/\\xf. 

For a differentiable map x 1 —)• f{x), mapping to M^, we use dx'f to abbreviate the 
partial derivatives {d/dx')f, and we correspondingly use the expression dx'f{xo) to mean 
dx'f{x) \x=xo, etc. We use x' to denote the transpose of a column vector x. 


2. A Testing and Estimation Approach to Valid Post-Selection and 

Post-Regularization Inference 

2.1. The Setting. We assume that estimation is based on the first n elements (u’j,n)f=i 
of the stationary data-stream which lives on the probability space (H, A, Pn). 

The data points Wi^n take values in a measurable space W for each i and n. Here, Pn, 
the probability law or data-generating process, can change with n. We allow the law to 
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change with n to claim robustness or uniform validity of results with respect to pertur¬ 
bations of such laws. Thus the data, all parameters, estimators, and other quantities are 
indexed by re, but we typically suppress this dependence to simplify notation. 

The target parameter value a = ao is assumed to solve the system of theoretical 
equations 

M(a,?7o) = 0, 

where M = is a measurable map from .A x to and A x Ti are some 

convex subsets of x M^. Here the dimension d of the target parameter a G A and 
the number of equations k are assumed to be fixed and the dimension p = pn oi the 
nuisance parameter p G T-L is allowed to be very high, potentially much larger than re. To 
handle the high-dimensional nuisance parameter p, we employ structured assumptions 
and selection or regularization methods appropriate for the structure to estimate po. 

Given an appropriate estimator p, we can construct an estimator a as an approximate 
solution to the estimating equation: 

||M(d, 77)11 < inf ||M(a,r})||-F o(re"^/^) 

OiGA 

where M = is the empirical analog of theoretical equations M, which is a mea¬ 

surable map from W” x AxTd to We can also use M(a, p) to test hypotheses about 
ao and then invert the tests to construct confidence sets. 

It is not required in the formulation above, but a typical case is when M and M are 
formed as theoretical and empirical moment functions: 

M(a, p) := E[i/;{wi, a, p)], M(q;, p) := EniipitVi, a, p)], 

where 7/7 = is a measurable map from W x A x% to Of course, there are 

many problems that do not fall in the moment condition framework. 


2.2. Valid Inference via Testing. A simple introduction to the inferential problem is 
via the testing problem in which we would like to test some hypothesis about the true 
parameter value ao- By inverting the test, we create a confidence set for ao. The key 
condition for the validity of this confidence region is adaptivity, which can be ensured 
by using orthogonal estimating equations and using structured assumptions on the high¬ 
dimensional nuisance parameter^ 

The key condition enabling us to perform valid inference on ao is the adaptivity 
condition: 

\/re(M(ao,7?) - M(ao,r/o))0. (1) 

This condition states that using y^M(ao, p) is as good as using -^M(ao, po), at least to 
the first order. This condition may hold despite using estimators p that are not asymp¬ 
totically linear and are non-regular. Verification of adaptivity may involve substantial 
work as illustrated below. A key requirement that often arises is the orthogonality or 
immunization condition: 

9^/M(ao,77o) = 0. 


^We refer to 


Bickel 


(19821 for a definition of and introduction to adaptivity. 


(2) 
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This condition states that the equations are locally insensitive to small perturbations of 
the nuisance parameter around the true parameter values. In several important models, 


this condition is equivalent to the double-robustness condition (Robins and Rotnitzky 


(1995)). Additional assumptions regarding the quality of estimation of 7?o are also needed 


and are highlighted below. 

The adaptivity condition immediately allows us to use the statistic y/nM{aQ,fj) to 
perform inference. Indeed, suppose we have that 


fl ^/^(ao)\/raM(ao,i?o)A/'(0,4 


(3) 


for some positive definite II(q!) = Var(\/i2']VI(a, ?7o))- This condition can be verified 
using central limit theorems for triangular arrays. Such theorems are available for in¬ 
dependently and identically distributed (i.i.d.) as well as dependent and clustered data. 
Suppose further that there exists fl(a) such that 


^ —)-p„ 4- 


(4) 


It is then immediate that the following score statistic, evaluated at a = ao, is asymp¬ 
totically normal, 

S{a) := fj) AA(0,4), (5) 

and that the quadratic form of this score statistic is asymptotically with k degrees of 
freedom: 


C(ao) = ||5(«o)||' 




( 6 ) 


The statistic given in (j^ simply corresponds to a quadratic form in appropriately 
normalized statistics that have the desired immunization or orthogonality condition. We 
refer to this statistic as a “generalized C(a)-statistic” in honor of Neyman’s fundamental 
contributions, e.g. Neyman (1959) and Neyman (1979), because, in likelihood settings, 
the statistic (|^ reduces to Neyman’s C'(Q;)-statistic and the generalized score S{ao) given 
in ([^ reduces to Neyman’s orthogonalized score. We demonstrate these relationships 
in the special case of likelihood models in Section 3.1 and provide a generalization to 
GMM models in Section 3.2. Both of these examples serve to illustrate the construction 
of appropriate statistics in different settings, but we note that the framework applies far 
more generally. 

The following elementary result is an immediate consequence of the preceding discus¬ 
sion. 


Proposition 1 (Valid Inference After Selection or Regularizaton). Consider a sequence 
{P,i} of sets of probability laws such that for each sequence {Pn} £ {Pn} Ihe adaptivity 
condition 0 , the normality condition 0 , and the variance consistency condition 0 
hold. Then CRi_a = {a £ A : C{a) < c(l — a)}, where c(l — a) is the 1 — a-quantile of 
a x^{k), is a uniformly valid confidence interval for oq in the sense that 

lim sup |P(ao £ CRi_a) — (1 — a)| = 0. 

rn-oopgp^ 
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We remark here that in order to make the uniformity claim interesting we should insist 
that the sets of probability laws Pn, are non-decreasing in n, i.e. Pfi C P„ whenever 
n <n. 

Proof. For any sequence of positive constants approaching 0, let P^ G P^, be any 
sequence such that 

|Pn(a:o G CRi_a) - (1 - a)\ +en> sup |P(ao G CRi_a) - (1 - a)|. 

PePn 

By conditions Q and Q we have that 

Pn(ao G CRi_a) = Pn(C'(ao) < c(l - a)) IP(x^(fc) < c(l - a)) = 1 - a, 
which implies the conclusion from the preceding display. ■ 


2.3. Valid Inference via Adaptive Estimation. Suppose that M(a 0 )%) = 0 holds 
for ao G A. We consider an estimator d G A that is an approximate minimizer of the 
map a i—)• ||M(a,? 7 )|| in the sense that 

||M(q;,? 7 )|| < inf ||M(a, 77)11(7) 
a&A 


In order to analyze this estimator, we assume that the derivatives Pi := 9a'M(ao, rjo) 
and drjiM{a,r]o) exist. We assume that ao is interior relative to the parameter space A; 
namely, for some —)• oo such that inj —)• 0, 

{a G : ||a — ao|| < injVn} C A. (8) 

We also assume that the following local-global identifiability condition holds: For some 
constant c > 0, 

2||M(a, 770)11 > ||Fi(a — ao)|| A c Va G A, mineig(F'iFi) > c. (9) 

Furthermore, for 0 = Var(yd7M(ao, 770 )), we suppose that the central limit theorem, 

770 ) --^p„ AA(0, /), (10) 

and the stability condition, 

||f;Fi|| + ||0|| + ||0-1||<1, (11) 


hold. 


Assume that for some sequence of positive numbers {r„} such that —)• 0 and 

—7- 00 , the following stochastic equicontinuity and continuity conditions hold: 


||M(a, 77 ) - M(a,77)11 ||M(a, 77 ) - M(a, 77 o)|| 

sup -7- 

a&A ||M(a, 77)11-h ||M(a, 770) II 

||M(a, 77 ) - M(a, 77 ) - M(ao, 77 o)|| ^ 

sup - : - 

\W-ao\\<r„ 77-V2 ||M(a, 77 )|| ||M(a, 77 o)|| 


( 12 ) 


( 13 ) 
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Suppose that uniformly for all a ^ uq such that ||a — ao|| < fn — t 0, the following 
conditions on the smoothness of M and the quality of the estimator fj hold, as n —>■ oo: 

||M(a,?7o) - M(ao,??o) - ri[a - ao]||||a - aolT^ -t 0, 

y/n\\M{a,fi) - M(a,r/o) - dr^>M{a,r]o)[fi - r/o]|| 0, (14) 

\\{dr,'M{a,r]o) - 9^/M(ao,r/o)}[^ - Vo]\\\\a - ao||"^ -^p„ 0. 

Finally, as before, we assume that the orthogonality condition 

9^/M(ao,i?o) = 0 (15) 


holds. 

The above conditions extend the analysis of 


Fakes and Pollard (1989) and Chen et al 


(2003), which in turn extended Huber’s (1964) classical results on Z-estimators. These 
conditions allow for both smooth and non-smooth systems of estimating equations. The 
identifiability condition imposed above is mild and holds for broad classes of identifi¬ 
able models. The equicontinuity and smoothness conditions imposed above require mild 
smoothness on the function M and also require that ?) is a good-quality estimator of 
r/Q. In particular, these conditions will often require that r) converges to ijq at a faster 
rate than as demonstrated, for example, in the next section. However, the rate 

condition alone is not sufficient for adaptivity. We also need the orthogonality condition 
(15). In addition, it is required that fj G Tin, where "Hn is a set whose complexity does 


not grow too quickly with the sample size, to verify the stochastic equicontinuity con- 


dition; see, e.g., 

Belloni, Chernozhukov, Fernandez-Val and Hansen 

(2013 

) and 

Belloni, 

Chernozhukov and Kato 

(20136 

). In the next section, we use the sparsity of fj to control 


this complexity. Note that conditions (12)-(13) can be simplified by leaving only 
and in the denominator, though this simplification would then require imposing 

compactness on A even in linear problems. 


Proposition 2 (Valid Inference via Adaptive Estimation after Selection or Regulariza¬ 
tion). Consider a sequence {Pn} of sets of probability laws sueh that for each sequence 
{Pn} £ {Pn} conditions hold. Then 

y/n{a - Oo) + [r'iri]“^r'iV«M(Q!o,r/o) -tp„ 0. 

In addition, for Vn ■= (r}ri)“^r}Hri(r{ri)“^, we have that 

lim sup sup — oo) G R) — P(AA(0,I) G i?)| = 0, 

n^oo pgp^ 

where IZ is a collection of all convex sets. Moreover, the result continues to apply ifVn is 
replaced by a consistent estimator Vn such that Vn — Vn —tp„ 0 under each sequence {Pn}- 
Thus, CR}_„ = [I'a ± c(l — a/2)(f'V)if/n)^/^] where c(l — a/2) is the (1 — a/2)-quantile 
o/AA(0,1) is a uniformly valid confidence set for I'a^: 

lim sup |P(/^ao £ CR}_q) — (1 — a)| = 0. 


Note that the above formulation implicitly accommodates weighting options. Suppose 
M° and M° are the original theoretical and empirical systems of equations, and let 
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rf = daiM°{ao,r]Q) be the original Jacobian. We could consider k x k positive-definite 
weight matrices A and A such that 

I|A2|| + ||(A2)-1||<1, ||A2-A2||^p„0. (16) 

For example, we may wish to use the optimal weighting matrix A^ = Var(-y/nM"(ao 5 
which can be estimated by A^ obtained using a preliminary estimator d" resulting from 
solving the problem with some non-optimal weighting matrix such as I. We can then 
simply redefine the system of equations and the Jacobian according to 

M(a,7?) = AM°(a,r/), M(a, r/) = AM°(a, ??), Fi = AF?. (17) 


Proposition 3 (Adaptive Estimation via Weighted Equations). Consider a sequence 
{P,i} of sets of probability laws such that for each sequence {Pn} £ {Pn} the conditions 
of Proposition hold for the original pair of systems of equations (M°, M°) and that 
( 16 ) holds. Then these conditions also hold for the new pair (M, M) in ( 11 ), so that all 
the conclusions of Proposition\^ apply to the resulting approximate argmin estimator a. 
In particular, if we use A? = Var(ydiM°(Q; 0 ) ho))~^ o-nd A^ — A^ —^p„ 0, then the large 
sample variance Vn simplifies to Vn = (F'^Fi)"^. 


2.4. Inference via Adaptive “One-Step” Estimation. We next consider a “one- 
step” estimator. To define the estimator, we start with an initial estimator a that 
satisfies, for = o(n“^/^), 

Pn(||d - Ooll < 'Tn) 1- (18) 

The one-step estimator a then solves a linearized version of 0: 

a = a- [f;f i]-^f;M(d, fj) (19) 

where Fi is an estimator of Fi such that 

Pn(||fi — Fill < Vn) —>• 1- (20) 

Since the one-step estimator is considerably more crude than the argmin estimator, we 
need to impose additional smoothness conditions. Specifically, we suppose that uniformly 
for all a 7^ ao such that ||a — ao|| < i”?! —^ 0, the following strengthened conditions on 
stochastic equicontinuity, smoothness of M and the quality of the estimator i) hold, as 
n —>■ oo: 

nP2||M(a,r)) - - M(ao,ho)|| ->P„ 0, 

||M(a,r?o) - M(ao,%) - - ao]||||a - aoU"^ < 1, ( 21 ) 

y/h\\M{a, f) - M(a, %) - <9^'M(a, r?o) [i? - %] || -^p„ 0, 

\/ra||{d^'M(Q!,ryo) - d^'M(ao,%)}[h - ho]|| ->P„ 0. 


Proposition 4 (Valid Inference via Adaptive One-Step Estimators). Consider a se¬ 
quence {Pn} of sets of probability laws such that for each sequence {Pn} £ {Pn} the 
conditions of Propositi on as well as (18), (20), and (21) hold. Then the one-step 
estimator a defined by (19) is first order equivalent to the argmin estimator a: 


^/n{a - a) —)•?„ 0. 
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Consequently, all conclusions of Proposition^ apply to a in place of a. 

The one-step estimator requires stronger regularity conditions than the argmin esti¬ 


mator. Moreover, there is finite-sample evidence (e.g. Belloni, Chernozhukov and Wei 


(2013)) that in practical problems the argmin estimator often works much better, since 
the one-step estimator typically suffers from higher-order biases. This problem could 
be alleviated somewhat by iterating on the one-step estimator, treating the previous 
iteration as the “crude” start a for the next iteration. 


3. Achieving Orthogonality Using Neyman’s Orthogonalization 


Here we describe orthogonalization ideas that go back at least to Neyman (1959) 


see 


also Neyman (1979). Neyman’s idea was to project the score that identihes the parameter 


of interest onto the ortho-complement of the tangent space for the nuisance parameter. 
This projection underlies semi-parametric efficiency theory, which is concerned partic¬ 
ularly with the case in which r] is infinite-dimensional, cf. van der Vaart (1998). Here 


we consider finite-dimensional rj of high dimension; for discussion of infinite-dimensional 


p in an approximately sparse setting, see 

Belloni, Chernozhukov, Fernandez-Val and 

Hansen 

(2013 

) and 

Belloni, Chernozhukov and Kato 

(20136 

)■ 


3.1. The Classical Likelihood Case. In likelihood settings, the construction of or¬ 
thogonal equations was proposed by Neyman (1959) who used them in construction of 
his celebrated C'(a)-statistic. The C'(a)-statistic, or the orthogonal score statistic, was 
first explicitly utilized for testing (and also for setting up estimation) in high-dimensional 


sparse models in Belloni, Chernozhukov and Kato (20136) and Belloni, Chernozhukov 


and Kato (2013a), in the context of quantile regression, and Belloni, Chernozhukov and 


Wei (2013) in the context of logistic regression and other generalized linear models. More 


recent uses of C'(a!)-statistics (or close variants) include those by Voorman et al. (2014), 
Ning and Liu ( 2014[ ), and Yang et ^ (2014). 


Suppose that the (possibly conditional, possibly quasi) log-likelihood function associ¬ 
ated with observation Wi is l{wi,a, (5), where a G A C is the target parameter and 
ft £ B C is the nuisance parameter. Under regularity conditions, the true parameter 
values 70 = (ag,/3o)' obey 


E[dai{wi,ao,/3o)]=0, E[d/3£(u;i, ao,/^o)] = 0. (22) 


Now consider the moment function 


M{a,r]) = E[ip{wi,a,r])], 'ip{wi,a,r]) = dai{wi,a, j3) - p,dfji{wi,a, /3). (23) 

Here the nuisance parameter is 

7] = (/?', vec{fj,yy £ B xV C M^, p = Po + dpo, 

where p, is the dxpo orthogonalization parameter matrix whose true value pQ solves the 
equation: 


dap pJpp — 0 ( i’G') PO — JapJ^^y 


( 24 ) 
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where, for 7 := [a',j3'y and 70 := (goj/Sq)', 

Note that /io not only creates the necessary orthogonality but also creates 


• the optimal score (in statistical language) 

• or, equivalently, the optimal instrument/moment (in econometric language )|^ 


for inference about ao- 


Provided /jq is well-defined, we have by (22) that 


M(ao,??o) = 0. 


Moreover, the function M has the desired orthogonality property: 


<9r,'M(Q!o, r/o) 


Ja/3 - IJ-oJ/sp', FE[d 0 £{wi, ao, /3o)] 


= 0 , 


(25) 


where F is a tensor operator, such that Fx = dpLX / dYec{p)' is a d x (dpo) matrix 

for any vector x in Note that the orthogonality property holds for Neyman’s 

construction even if the likelihood is misspecified. That is, £{wi,^Q) may be a quasi¬ 
likelihood, and the data need not be i.i.d. and may, for example, exhibit complex 
dependence over i. 


An alternative way to define hq arises by considering that, under the correct specifi¬ 
cation and sufficient regularity, the information matrix equality holds and yields 

j = jO ;= E[d^£{wi,'^)d^£{wi,'^)']\^=^Q 



Hence define /Tq = the population projection coefficient of the score for the 

main parameter da£{wi,'yo) on the score for the nuisance parameter 'yo): 

da£{wi,lo) = 7 o) + Q, F[Qdp£{wi,'yo)'] = 0 . (26) 

We can see this construction as the non-linear version of Frisch-Waugh’s “partialling out” 
from the linear regression model. It is important to note that under misspecification the 
information matrix equality generally does not hold, and this projection approach does 
not provide valid orthogonalization. 


Lemma 1 (Neyman’s orthogonalization for (quasi-) likelihood scores). Suppose that 
for each 7 = (a,/3) £ Ax B, the derivative d^i{wi,'y) exists and is continuous at 7 
with probability one, and obeys the dominance condition Esup..|,g_ 4 xg \\d-^£{wi,'y)\\‘^ < 00 . 


Suppose that condition (22) holds for some (quasi-) true value (a;o,/3o). Then, (i) if J 
exists and is finite and Jpp is invertible, then the orthogonality condition |13|) holds; (ii) 


^The connection betw een optimal instrum ents/moments and likelihood/score has been elncidated by 
the fundamental work of Chamberlain (19871. 
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if the information matrix equality holds, namely J = J^, then the orthogonality condition 
(25) holds for the projection parameter /ig in place of the orthogonalization parameter 
matrix hq. 


The claim follows immediately from the computations above. 


With the formulations given above Neyman’s C'(a)-statistic takes the form 
C(a) = ||5'(a)||2, S{a) = 

where M(a,7}) = Kn['tp{wi,a,fi)] as before, Q{a,r]o) = Var(-y/nM(a, r/o)), and (l{a,fi) 
and fj are suitable estimators based on sparsity or other structured assumptions. The 
estimator is then 

a = arg inf C{a) = arg inf ||-v/nM(Q;, ? 7 )||, 
osT asT 

provided that fi(a, fj) is positive definite for each a ^ A. If the conditions of Section 2 
hold, we have that 




(27) 


where 14 = ^ and Ti = J^q, — Under the correct specification 

and i.i.d. sampling, the variance matrix 14 further reduces to the optimal variance 

1 ";^ = {Jaa — JapJpp Jap) ) 

of the first d components of the maximum likelihood estimator in a Gaussian shift 
experiment with observation Z ~ AA(/i, Jq^)- Likewise, the result (27) also holds for the 
one-step estimator a of Section 2 in place of a as long as the conditions in Section 2 
hold. 


Provided that sparsity or its generalizations are plausible assumptions to make re¬ 
garding r/o, the formulations above naturally lend themselves to sparse estimation. For 
example, Belloni, Chernozhukov and Wei (2013) used penalized and post-penalized max¬ 
imum likelihood to estimate /3o, and used the information matrix equality to estimate 
the orthogonalization parameter matrix /ig by using Lasso or Post-Lasso estimation of 
the projection equation (26). It is also possible to estimate no directly by finding approx¬ 
imate sparse solutions to the empirical analog of the system of equations Jap — n^pp = 0 
using ^i-penalized estimation, as, e.g., in van de Geer et al. (2014), or post-^i-penalized 


estimation. 


3.2. Achieving Orthogonality in GMM Problems. Here we consider 70 = (og, /3g)' 
that solve the system of equations: 

E[m(t(;i,ao,/3o)] = 0, 

where m ■. W x A x B , A x B \s a convex subset of x , and A: > d -|- po is 
the number of moments. The orthogonal moment equation is 

M(a, rj) = E[V’(rci, a, p)], a, rj) = nm{wi, a, /3). 

The nuisance parameter is 

rj = (/3', vec(p)')' e B xV (Z M^, p = Po + dk, 


(28) 
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where /U is the d x k orthogonalization parameter matrix. The “true value” of /r is 
/io = 

where, for 7 = (a',/?')' and 70 = (oq,/?^)', 


G-y = dyElm^Wi, a, /3)] 
and 


7=70 


da'E[m{wi, a, l3)],di3>E[m{wi, a, /3)] 


7=70 


Gc,,Gp 


Qm = Var(\/nEn[m(u;i, ao, /3o)]). 

As before, we can interpret no as an operator creating orthogonality while building 


• the optimal instrument/moment (in econometric language), 

• or, equivalently, the optimal score function (in statistical language) 


The resulting moment function has the required orthogonality property; namely, the first 
derivative with respect to the nuisance parameter when evaluated at the true parameter 
values is zero: 

dr,'M{ao,r})\r,=no = [noGp,FE[m{wi,ao,/3o)]] = 0, (29) 

where F is a tensor operator, such that Fx = dnx/d\ec{n)' is a d x {dk) matrix 

for any vector x in 


Estimation and inference on ao can be based on the empirical analog of (28): 

M(a, 77 ) = a, r))]. 


where r) is a post-selection or other regularized estimator of 770 - Note that the previous 
framework of (quasi)-likehhood is incorporated as a special case with 

m(zci,a,/3) = [da(.{wi,a)\dj^l{wi, I3)'\. 


With the formulations above, Neyman’s C'(a)-statistic takes the form: 

G{a) = ||5'(a)||2, S{a) = ri"^/^(a, ??)VraM(a, ??), 

where M(a,? 7 ) = E„[' 0 (r 7 ;j, a, 77 )] as before, Q{a,rjo) = Var(y^M(a, 770 )), and Cl{a,fj) and 
fj are suitable estimators based on structured assumptions. The estimator is then 

d = arg inf G{a) = arg inf ||\/nM(Q;, t))!!, 
oeA aeA 


provided that f 2 (a,? 7 ) is positive definite for each a £ A. If the high-level conditions of 
Section 2 hold, we have that 


G{a) x^{d), ^/2\/n(d - a) AA(0,1), 


(30) 


where Vn = (r'^)“^n(ao, 7 o)(ri)~^ coincides with the optimal variance for GMM; here 
Ti = noGa- Likewise, the same result (30) holds for the one-step estimator d of Section 
2 in place of a as long as the conditions in Section 2 hold. In particular, the variance 
Vn corresponds to the variance of the first d components of the maximum likelihood 
estimator in the normal shift experiment with the observation Z rsj M{h, ( g ; q - ig ^)- i ). 


“^Cf. previous footnote. 
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The above is a generic outline of the properties that are expected for inference using 
orthogonalized GMM equations under structured assumptions. The problem of inference 
in GMM under sparsity is a very delicate matter due to the complex form of the orthog- 
onalization parameters. One approach to the problem is developed in Chernozhukov| 


et al. (2014). 


4. Achieving Adaptivity In Affine-Quadratic Models via Approximate 

Sparsity 

Here we take orthogonality as given and explain how we can use approximate sparsity 
to achieve the adaptivity property 0- 


4.1. The Afiine-Quadratic Model. We analyze the case in which M and M are affine 
in a and affine-quadratic in rj. Specifically, we suppose that for all a 

M{a,T]) = ti{r])a + t 2 {r]), M{a,r]) = Ti{r])a -k r 2 (r?), 

where the orthogonality condition holds, 

dn'M{ao,r]o) = 0 , 


and r] i—)• Tj{r]) and r] i—)• Tj{r]) are affine-quadratic in r] for j = 1 and j = 2. That is, 
we will have that all second-order derivatives of and Tj{r]) for j = 1 and j = 2 are 
constant over the convex parameter space Ti for rj. 

This setting is both useful, including most widely used linear models as a special case, 
and pedagogical, permitting simple illustration of the key issues that arise in treating 
the general problem. The derivations given below easily generalize to more complicated 
models, but we defer the details to the interested reader. 

The estimator in this case is 


a = arg min ||M(a,r})|p = -[ri(? 7 )'fi(r})] ^ri(77)'f 2 ( 17 ), 
oSK'* 

provided the inverse is well-defined. It follows that 

- ao) = -[fi(r7)'fi(r})]“^fi(r/)'VnM(ao,r/). 
This estimator is adaptive if, for Ti := ri(? 7 o), 

\fn{a - ao) + [r'iri]“^r^\/«M(ao, %) -^p„ 0, 
which occurs under the conditions in (llOl) and (11) if 


(31) 

(32) 


\/n(M(ao,?)) - M(ao,??o)) ->p„ 0, fi(?}) - ri(?7o) 0. 


(33) 


Therefore, the problem of the adaptivity of the estimator is directly connected to the 
problem of the adaptivity of testing hypotheses about uq. 
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Lemma 2 (Adaptive Testing and Estimation in Affine-Quadratic Models). Consider 
a sequence {Pn} of sets of probability laws such that for each sequence {Pn} £ {Pn}? 
conditions stated in the first paragraph of Section f.l, condition the asymptotic 

normality condition the stability condition Ihll]) , and condition ^ hold. Then all 

the conditions of Propositions 1 and 2 hold. Moreover, the conclusions of Proposition 1 
hold, and the conclusions of Proposition^hold for the estimator a in (31). 


4.2. Adaptivity for Testing via Approximate Sparsity. Assuming the orthogo¬ 


nality condition holds, we follow Belloni et al. (2012) in using approximate sparsity 
to achieve the adaptivity property ([^ for the testing problem in the affine-quadratic 
models. 


We can expand each element Mj of M = as follows: 


Vn{Mj{ao, fj) - Mj(ao, %)) = + P2,i + Ts, 




where 


T2,j 


= y/ndnMj{ao,r]oy{f] - ryo), 

= y/n{d^Mj{ao,r]o) - d,jMj{ao,r]o)y{fj - rjo), 
= y/n2-^{fi - rioydr^d^'Mj{ao){fi - rjo). 


(34) 


(35) 


The term Tij vanishes precisely because of orthogonality, i.e. 

Ti,j = 0 . 

However, terms T 2 J and need not vanish. In order to show that they are asymptot¬ 
ically negligible, we need to impose further structure on the problem. 

Structure 1: Exact Sparsity. We first consider the case of using an exact sparsity 
structure where ||r/o||o < s and s = > 1 can depend on n. We then use estimators fj 

that exploit the sparsity structure. 


Suppose that the following bounds hold with probability 1 — o(l) under P„: 

< S. llunllQ < s, 


mh ^ V(s/n) log(pn) 


T/olli < v^(s2/n)log(pn). 


(36) 


These conditions are typical performance bounds which hold for many sparsity-based 
estimators such as Lasso, post-Lasso, and their extensions. 


We suppose further that the moderate deviation bound 

T 2 ,j = \\Vn{dr,’Mj{ao,r]o) - a^/Mj(ao, i?o)) ||oo <p„ \/log(pnj 
holds and that the sparse norm of the second-derivative matrix is bounded: 


— ll^r;9^'Mj(Q:o)||sp{Cs) 


<t 


1 


(37) 


(38) 


where in ^ 00 but in = o(logn). 


Following Belloni et al. (2012), we can verify condition (37) using the moderate devl 


ation theory for self-normalized sums (e.g., Jing et al. (2003)), which allows us to avoid 


making highly restrictive subgaussian or gaussian tail assumptions. Likewise, following 


Belloni et al. (2012), we can verify the second condition using laws of large numbers 
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for large matrices acting on sparse vectors as in Rudelson and Vershynin (2008) and 


Rudelson and Zhou (2011); see Lemma 7. Indeed, condition (38) holds if 


The above analysis immediately implies the following elementary result. 


Lemma 3 (Elementary Adaptivity for Testing via Sparsity). Let {Pn} cl sequence of 
probability laws. Assume (i) rj i—)• M(a 0 )i?) cind r] i—)• M(a 0 )i?) cire affine-quadratic in r] 
and the orthogonality condition holds, (ii) that the conditions on sparsity and the quality 
of estimation (36) hold, and the sparsity index obeys 


log(pn)^/n —)• 0, 


(39) 

(Hi) that the moderate deviation bound (31) holds, and (iv) the sparse norm of the second 
derivatives matrix is bounded as in (38). Then the adaptivity condition 0 holds for the 
sequence {Pn}- 


We note that (39) requires that the true value of the nuisance parameter is sufficiently 
sparse, which we can relax in some special cases to the requirement s \og{pn)^/n —)• 0, for 
some constant c, by using sample-splitting techniques; see Belloni et al. ( 2012[ ). However, 
this requirement seems unavoidable in general. 

Proof. We note above that Tij = 0 by orthogonality. Under (36)-(37) if log(pn)^/n — 
0, then T 2 J vanishes in probability, as by Holder’s inequality, 

T2,j < T2,il|i/ - I/O 111 <P„ \/s 2 log(pn)Vn 0 . 

Also, if log(pn)^/re —>■ 0, then vanishes in probability, since by Holder’s inequality 
and for sufficiently large n. 


T3,i < P3,il|i/-i/o| 


<r 


^/ns\og{pn)/n —)-p„ 0. 


The conclusion follows from (34). ■ 

Structure 2. Approximate Sparsity. Following Belloni et al. (2012), we next 


consider an approximate sparsity structure. Approximate sparsity imposes that, given 
a constant c > 0, we can decompose ryo into a sparse component rjfd and a “small” 
non-sparse component rf\ 

1/0 = 1 ?™ + Vo, support(ry((*) n support(7/(; ) = 0 , 

||l?™||o<S, ||l?ol|2 < hSlIl < C^s+M. 

This condition allows for much more realistic and richer models than can be accommo¬ 
dated under exact sparsity. For example, r/o needs not have any zero components at all 


under approximate sparsity. In Section 5, we provide an example in which (40) arises 
from a more primitive condition that the absolute values {|i/oj|,j = l,---,p}, sorted in 
decreasing order, decay at a polynomial speed with respect to j. 

Suppose that we have an estimator fj such that with probability 1 — o(l) under 
the following bounds hold: 

< . (41) 


lo < S, 


Voh ^ V(s/n)log(pn), 


Vo lb rs.. 
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This condition is again a standard performance bound expected to hold for sparsity- 
based estimators under approximate sparsity conditions; see Belloni et al. (2012). Note 


that by the approximate sparsity condition, we also have that, with probability 1 — o(l) 
under Pr,, 


- mh ^ \/{s/n)\og{pn), 11 ?? - %||i < V {s^/n) \og{pn). 


(42) 


We can employ the same moderate deviation and bounded sparse norm conditions as 
in the previous subsection. In addition, we require the pointwise norm of the second- 
derivatives matrix to be bounded. Specifically, for any deterministic vector a 7 ^ 0, we 
require 

ll'9)7^pMj(ao)||pw(a) ^Pn 1- (43) 

This condition can be easily verified using ordinary laws of large numbers. 


Lemma 4 (Elementary Adaptivity for Testing via Approximate Sparsity). Let {Pn} 
he a sequence of probability laws. Assume (i) t? i—)■ M(a 0 )?/) CL'n-d t? i—)■ M(q!o,?/) are 
affine-quadratic in r? and the orthogonality condition holds, (ii) that the conditions on 
approximate sparsity (fO) and the quality of estimation ei hold, and the sparsity index 
obeys 

s^ log(pn)^/n —>• 0 , 

(Hi) that the moderate deviation bound [31) holds, (iv) the sparse norm of the second 
derivatives matrix is hounded as in (38), and (v) the pointwise norm of the second 
derivative matrix is hounded as in Then the adaptivity condition Q holds: 

Vn(M(ao, fj) - M(ao, %)) ->P„ 0. 


4.3. Adaptivity for Estimation via Approximate Sparsity. We work with the 
approximate sparsity setup and the affine-quadratic model introduced in the previous 
subsections. 


In addition to the previous assumptions, we impose the following conditions on the 
components of d^Pi, where m = l,...,fe and I = l,...,d,. First, we need the 

following deviation and boundedness condition: For each m and I, 

||5)?f'l,mz(??o) “ 9riTi „ii(rjo)\\cx:, ^P„ Ij ||(?/o) ||oo ^ 1- (44) 

Second, we require the sparse and pointwise norms of the following second-derivative 
matrices be stochastically bounded: For each m and I, 

||sp(£„s) + l,mZ ||pw(a) ^Pn 1) (45) 

where a 7 ^ 0 is any deterministic vector. Both of these conditions are mild. They can be 
verified using self-normalized moderate deviation theorems and by using laws of large 
numbers for matrices as discussed in the previous subsection. 


Lemma 5 (Elementary Adaptivity for Estimation via Approximate Sparsity). Consider 
a sequence {Pn} for which the conditions of the previous lemma hold. In addition assume 
that the deviation bound (44) holds and the sparse norm and pointwise norms of the 
second derivatives matrices are stochastically hounded as in (45). Then the adaptivity 
condition (33) holds for the testing and estimation problem in the affine-quadratic model. 
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5. Analysis of the IV Model with Very Many Control and Instrumental 

Variables 


Note that in the following we write ti; _L i; to denote Cov{w, v) = 0. 

Consider the linear instrumental variable model with response variable: 

Vi = d[ao + x'/3o + Si, E[ei] =0, e* ± {zi, Xi), (46) 


where yi is the response variable, di = is a p'^-vector of endogenous variables, 

such that 

dii = *-701 +4^01 +Uii, 

dipd V "^ipd, 

Here Xi = {xij)^^^ is a p^-vector of exogenous control variables, including a constant, 

and Zi = {zi)^^i is a p^-vector of instrumental variables. We will have n i.i.d. draws of 
Wi = {yiid'i,x^, z[y obeying this system of equations. We also assume that Var(u;i) is 
finite throughout so that the model is well defined. 


E[uii] = 0, UiiJz{zi,Xi), 
E[uipd]= 0 , Uipd ± izi,Xi). 


(47) 


The parameter value ao is our target. We allow ^ n and p^ = Pn ^ 

n, but we maintain that p'^ is fixed in our analysis. This model includes the case of 
many instruments and small number of controls considered by Belloni et al. (2012) as 
a special case, and the analysis readily accommodates the case of many controls and 
no instruments - i.e. the linear regression model - considered by Belloni et al. (2010a); 


Belloni, Chernozhukov and Hansen (2014) and Zhang and Zhang (2014). Eor the latter, 


we simply set p^ = 0 and impose the additional condition e* T Ui for Ui = (iiij)jli, which 
together with e* T Xi implies that e* T d*. We also note that the condition Si T x*, Zi is 
weaker than the condition E[ei\xi, Zi] = 0, which allows for some misspecification of the 
model. 


We may have that Zi and Xi are correlated so that Zi are valid instruments only after 
controlling for xp, specifically, we let Zi = Hxj + Ci, for H a p^ x p^ matrix and Ci a 
p^-vector of unobservables with Xj T Cu Substituting this expression for Zi as a function 
of Xi into (46) gives a system for y, and di that depends only on xp. 


Vi = x'do + Pi, 

dii = x'i?oi +/ofi, 

dipd = Xi%‘i + Ptpd, 


np^] 

= 0, 

Pi -L Xi, 

npfi] 

= 0, 

Pa -L Xi, 

Hpipd. 

1 = 0, 

Pipd A Xi 


(48) 


Because the dimension p 


Pn of 

do = (^0) {'dok^lokj '^ofc)fc=i)^ 


may be larger than n, informative estimation and inference about ao is impossible with¬ 
out imposing restrictions on rjo. 
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To state our assumptions, we fix a collection of positive constants (a,A, c, C), where 
a > 1, and a sequence of constants <5,1 \ 0 and in oo. These constants will not vary 
with P, but rather we will work with collections of P defined by these constants. 

Condition AS.l We assume that rjQ is approximately sparse, namely that the de¬ 
creasing rearrangement (|??o|pj=i of absolute values of coefficients (|7/oil)j=i obeys 

\m\*j < a > 1, j = 1, ...,p. (49) 

Given this assumption we can decompose rjo into a sparse component ry™ and small 
non-sparse component t/q: 

% = + ho, support(r/^) n support(? 7 (;) = 0, 

h™||o<s, Wvoh <c^/s/n, WpoWi < c^s^jn, (50) 

J_ 

S = cn 2a , 

where the constant c depends only on (a, A). 

Condition AS.2 We assume that 

log(pn)^/n < o(l). (51) 


We shall perform inference on ao using the empirical analog of theoretical equations: 

M(ao,?yo) = 0, M(a, ry) := E [!/'(«;*, a, 77 )], (52) 

where if = {'ipkYk=i is defined by 


'ipk{wi,a,r]) := I yi - xffi - ^(d^^ - ix'fik + z'4 - 


k=l 


We can verify that the following orthogonality condition holds: 


dr,'M{ao,r]) 


T]= 7]0 


= 0 . 


(53) 


This means that missing the true value rjo by a small amount does not invalidate the 
moment condition. Therefore, the moment condition will be relatively insensitive to 
non-regular estimation of ryo. 


We denote the empirical analog of (52) as 


M(Q;,?y) = 0, M{a,rj) :=En['ilJiia,r])] . 


(54) 


Inference based on this condition can be shown to be immunized against small selection 
mistakes by virtue of orthogonality. 


The above formulation is a special case of the linear-affine model. Indeed, here we 
have 

M(Q;,7y) = ri(ry)a-h r2(ry), M(a,?y) = f 1(77)0;-k f 2(17), 

ri( 7 y) = E[?/:“( 7 Ci, 7 y)], fi( 7 y) = En['ip°'{wi,r])], 
r2(7y) = f2(7y) = En['ip^{wi,r])], 
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where 

= -idfk - x'i’9i){x'i-ik + 4dk - x'i^k), 
i>k{wi,v) = iVi - XiO){x'ffk + z-dk - x'i'&k)- 


Consequently we can use the results of the previous section. In order to do so we need 
to provide a suitable estimator for rjQ. Here we use the Lasso and Post-Lasso estimators, 
as defined in|Belloni et al. (2012), to deal with non-normal errors and heteroscedasticity. 


Algorithm 1 (Estimation of rjo). (1) For each k, do Lasso or Post-Lasso Regression of 
dik on Xi,Zi to obtain ^k and 6k- (2) Do Lasso or Post-Lasso Regression of yi on xi to 
get 9. (3) Do Lasso or Post-Lasso Regression of dik = x'^fk + z^6k on xi to get idk- The 
estimator of rjo is given by fj = {9', {'9'k^%k^6'^ffj.^-f)'. 


We then use 

12 ( 0 ;, r?) = 'Kn[ip{wi, a, fi)'ip{wi, a, fj)']. 

to estimate the variance matrix Q{a,r]o) = Kn['if{wi,a,r]o)'ip{wi,a,r]oy]. We formulate 
the orthogonal score statistic and the C(a)-statistic, 

S{a) := ny^/'^{a,fi)y/nM{a,fi), C(a) = ||5'(a)f, (55) 

as well as our estimator a: 

d = argmin \\^/nM{a, fi)\f- 
a^A 

Note also that d = arg minag _4 C(a) under mild conditions, since we work with “exactly 
identified” systems of equations. We also need to specify a variance estimator Vn for the 
large sample variance 14 of d. We set W = (ri(?))0~^^(o5 i?)(ri(f)))~^. 


To estimate the nuisance parameter we impose the following condition. Let fi := 
ifijfjLi ■= {Xi,z'J-, hi := (/ijz)f^^ := (yi,d',J')' where di = {dik)l=i and dik ■= x^-fok + 

z'idok] Vi = (uiz)fZl^ := {ei,p\,pf,Qi)' where Qi = {Qik)k=i and Qik := dik - dik- Let 
hi .— hi E[/i2]. 


Condition RF. (i) The eigenvalues o/E[/j/(] are bounded from above by C and from 
below by c. For all j and I, (ii) E[h^i] + E[|4^/i^;|] l/Elffjvfi] < C and E\\f‘l^v‘fi\] < 
E[\ff-hffj\], (in) E[\ff-vfi\f‘\o^{pn)/n < 5n, and (iv) s\og{pn)/n < 6n- With probability 
no less than 1 — 6n, we have that (v) maxj<n,j log(pn)]/n < 6n and max;j |(En — 
E)[f!,vl]\ + |(E„ - E)[ffhl]\ < 6n and (vi) ||E„[/,/'] - E[/,/']< 6n- 

The conditions are motivated by those given in Belloni et al. (2012). The current 
conditions are made slightly stronger to account for the fact that we use zero covariance 
conditions in formulating the moments. Some conditions could be easily relaxed at a 
cost of more complicated exposition. 


To estimate the variance matrix and establish asymptotic normality, we also need the 
following condition. Let q > 4 be a fixed constant. 

Condition SM. For each I and k, E[|/ij/|^]-|-E[|uji|^] < C, (ii) c < E[e'f \ Xi, Zi] < 
C, c < E[q'(i^ I Xi,Zi] < C a-s-, (in) sup„g _4 ||a ||2 < C. 
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Under the conditions set forth above, we have the following result on validity of 
post-selection and post-regularization inference using the C'(a)-statistic and estimators 
derived from it. 

Proposition 5 (Valid Inference in Large Linear Models using C'(a)-statistics). Let Pn 
be the collection of all P such that Conditions AS.1-2, SM, and RF hold for the given n. 
Then uniformly in P G P^, S{ao) and C{aQ) ^ a consequenee, 

the confidenee set CRi_a = {a £ A : C{a) < c(l —a)}, where c(l —a) is the l — a-quantile 
of a is uniformly valid for ao, in the sense that 

lim sup |P(ao £ CRi_a) — (1 — a)| = 0. 

rn-oopgp^ 

Furthermore, for Vn = (rlL)“^n(aO) %)(ri)“^, we have that 

lim sup sup — oq) £ R) — P(AA(0,/) G R)\ = 0, 

n-5.00 pgp^ 

where TZ is the eollection of all eonvex sets. Moreover, the result continues to apply if 
Vn is replaeed by Vn- Thus, CRi_q = [I'a ± c(l — a/2){l'Vnl/n)^R], where c(l — a/2) is 
the (1 — a/2)-quantile of a AA(0,1), provides a uniformly valid eonfidenee set for I'a^: 

lim sup |P(i'ao £ CR*^_„) — (1 — a)| = 0. 

^^°°P 6 P„ 


5.1. Simulation Illustration. In this section, we provide results from a small Monte 
Carlo simulation to illustrate the performance of the estimator resulting from the appli¬ 
cation of Algorithm 1 in a small sample setting. As comparison, we report results from 
two commonly used “unprincipled” alternatives for which uniformly valid inference over 
the class of approximately sparse models does not hold. Simulation parameters were 
chosen so that approximate sparsity holds but exact sparsity is violated in such a way 
that we expected the unprincipled procedures to perform poorly. 


For our simulation, we generate data as n iid draws from the model 


Vi 

di 

Zi 


— oidi -|- x^/3 2£i 

= x '7 -I- z[5 -I- Ui 

= Hxi + .1250 
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where T, is a p^ x pf, matrix with T,kj = (0.5)l-^“^l and Ip^ is a x identity matrix. 
We set the number of potential controls variables (p^) to 200, the number of instruments 
(p^) to 150, and the number of observations (n) to 200. For model coefficients, we set 
a = 0 , /3 = 7 as p^—vectors with entries (dj = Xj = 1/(9^^)) ^ = 4/9 + J 

j < 4 and jdj = 7 ^ = l/(j^u) for j > 4, (5 as a p^—vector with entries 5j = Jj, and 
n = Vvi ,, Op. x(pS-Pn)]’ report results based on 1000 simulation replications. 


We provide results for four different estimators - an infeasible Oracle estimator that 
knows the nuisance parameters p (Oracle), two naive estimators, and the proposed 
“Double-Selection” estimator. The results for the proposed “Double-Selection” pro¬ 
cedure are obtained following Algorithm 1 using Post-Lasso at every step. To obtain the 
Oracle results, we run standard IV regression of yi — E[yj|xi] on di — E[(ij|xi] using the 
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single instrument C,[5. The expected values are obtained from the model above and C,[5 
provides the information in the instruments that is unrelated to the controls. 


The two naive alternatives offer unprincipled, although potentially intuitive alterna¬ 
tives. The first naive estimator follows Algorithm 1 but replaces Lasso/Post-Lasso with 
stepwise regression with a p-value for entry of .05 and a p-value for removal of .10 (Step¬ 
wise). The second naive estimator (Non-orthogonal) corresponds to using a moment 
condition that does not satisfy the orthogonality condition described previously but will 
produce valid inference when perfect model selection in the regression of d on x and 
z is possible or perfect model selection in the regression of y on x is possible and an 
instrument is selected in the d on x and z regressionj^ 


All of the Lasso and Post-Lasso estimates are obtained using the data-dependent 
penalty level from Belloni and Chernozhukov (2013). This penalty level depends on 


a standard deviation that is estimated adapting the iterative algorithm described in 
Belloni et al. (2012) Appendix A using Post-Lasso at each iteration. For inference in all 


cases, we use standard t-tests based on conventional homoscedastic IV standard errors 
obtained from the final IV step performed in each strategy. 


We display the simulation results in Figure 5.1, and we report the median bias (Bias 


median absolute deviation (MAD), and size of 5% level tests (Size) for each procedure in 
Table For each estimator, we plot the simulation estimate of the sampling distribution 
of the estimator centered around the true parameter and scaled by the estimated stan¬ 
dard error. With this standardization, usual asymptotic approximations would suggest 
that these curves should line up with a AA(0,1) density function which is displayed as 
the bold solid line in the figure. We can see that the Oracle estimator and the Double- 
Selection estimator are centered correctly and line up reasonably well with the W(0,1), 
although both estimators exhibit some mild skewness. It is interesting that the sampling 
distributions of the Oracle and Double-Selection estimators are very similar as predicted 
by the theory. In contrast, both of the naive estimators are centered far from zero, and 
it is clear that the asymptotic approximation provides a very poor guide to the finite 
sample distribution of these estimators in the design considered. 


The poor inferential performance of the two naive estimators is driven by different 
phenomena. The unprincipled use of stepwise regression fails to control spurious in¬ 
clusion of irrelevant variables which leads to inclusion of many essentially irrelevant 
variables, resulting in many-instrument-type problems (e.g. Chao et al. (2012)). In ad¬ 


dition, the spuriously included variables are those most highly correlated to the noise 
within sample which adds an additional type of “endogeneity bias”. The failure of the 


^Specifically, for the second naive alternative (Non-orthogonal), we first do Lasso regression of d 
on X and 2 to obtain Lasso estimates of the coefficients 7 and S. Denote these estimates as 7 _l and 
Sl, and denote the indices of the coefficients estimated to be non-zero as = {j : jLj 7 ^ 0 } and 
Iz — {j ■ ^Lj 7 ^ 0}. We then run Lasso regression of y on a: to learn the identities of controls that predict 
the outcome. We denote the Lasso estimates as 6 l and keep track of the indices of the coefficients 
estimated to be non-zero as /)( = {j : Olj 7 ^ 0}. We then take the union of the controls selected in either 
step lx = Ix U Ix- The estimator of a is then obtained as the usual 2SLS estimator of yi on di using all 
selected elements from Xi, Xij such that j € Ix, as controls and the selected elements from Zi, Zij such 
that j £ Ix, as instruments. 
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Oracle Stepwise 



-5 0 5 -5 0 5 


Figure 1. The figure presents the histogram of the estimator from each 
method centered around the true parameters and scaled by the estimated 
standard error from the simulation experiment. The red curve is the pdf 
of a standard normal which will correspond to the sampling distribution 
of the estimator under the asymptotic approximation. Each panel is 
labeled with the corresponding estimator from the simulation. 

“Non-orthogonal” method is driven by the fact that perfect model selection is not pos¬ 
sible within the present design: Here we have model selection mistakes in which the 
control variables that are correlated to the instruments but only moderately correlated 
to the outcome and endogenous variable are missed. Such exclusions result in standard 
omitted variables bias in the estimator for the parameter of interest and substantial size 
distortions. The additional step in the Double-Selection procedure can be viewed as a 
way to guard against such mistakes. Overall, the results illustrate the uniformity claims 
made in the preceding section. The feasible Double-Selection procedure following from 
Algorithm 1 performs similarly to the semi-parametrically efficient infeasible Oracle. We 
obtain good inferential properties with the asymptotic approximation providing a fairly 
good guide to the behavior of the estimator despite working in a setting in which perfect 
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Table 1. Summary of Simulation Results for the Estimation of a 


Method 

Bias 

MAD 

Size 

Oracle 

0.015 

0.247 

0.043 

Stepwise 

0.282 

0.368 

0.261 

Non-orthogonal 

0.084 

0.112 

0.189 

Double-Selection 

0.069 

0.243 

0.053 


This table summarizes the simulation results from a linear IV model with many instruments and 
controls. Estimators include an infeasible oracle as a benchmark (Oracle), two naive alternatives 
(Stepwise and Non-orthogonal) described in the text, and our proposed feasible valid procedure 
(Double-Selection). Median bias (Bias), median absolute deviation (MAD), and size for 5% level tests 
(Size) are reported. 


model selection is impossible. Although simply illustrative of the theory, the results are 
reassuring and in line with extensive simulations in the linear model with many controls 
provided in Belloni, Chernozhukov and Hansen (2014), in the instrumental variables 


model with many instruments and a small number of controls provided in Belloni et al. 


(2012), and in linear panel data models provided in Belloni, Chernozhukov, Hansen and 


Kozbur (2014). 


5.2. Empirical Illustration: Logit Demand Estimation. As further illustration of 
the approach, we provide a brief empirical example in which we estimate the coefficients 
in a simple logit model of demand for automobiles using market share data. Our example 


is based on the data and most basic strategy from Berry et al. (1995). Specifically, we 
estimate the parameters from the model 

log(Sit) - log(sot) = OoPit + -b £it, 

Pit = -b + Uit, 

where su is the market share of product i in market t with product zero denoting 
the outside option, pu is price and is treated as endogenous, xu are observed included 
product characteristics, and zu are instruments. One could also adapt the proposed 
variable selection procedures to extensions of this model such as the nested logit model 
or models allowing for random coefficients; see, e.g., Gillen et al. (2014) for an example 
with a random coefficient. 

In our example, we use the same set of product characteristics (x-variables) as used 
in obtaining the basic results in Berry et al. (1995). Specifically, we use five variables 
in Xit'. a constant, an air conditioning dummy, horsepower divided by weight, miles per 
dollar, and vehicle size. We refer to these five variables as the baseline set of controls. 


We also adopt the argument from Berry et al. (1995) to form our potential instru¬ 
ments. Berry et al. (1995) argue that that characteristics of other products will satisfy 


an exclusion restriction, E[ejt|xjv] = 0 for any r and j ^ i, and thus that any function 
of characteristics of other products may be used as instrument for price. This condition 
leaves a very high-dimensional set of potential instruments as any combination of func¬ 
tions of {xjr}j^i^T>i may be used to instrument for pu. To reduce the dimensionality, 
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Berry et al. (1995) use intuition and an exchangeability argument to motivate considera¬ 


tion of a small number of these potential instruments formed by taking sums of product 
characteristics formed by summing over products excluding product i. Specifically, we 
form baseline instruments by taking 


^k,it I ^ ^ ^k,rti ^ ^ ^k,rt 

where is the element of vector xu and If denotes the set of products produced 
by firm /. This choice yields a vector zu consisting of 10 instruments. We refer to this 
set of instruments as the baseline instruments. 

Although the choice of the baseline instruments and controls is motivated by good 
intuition and economic theory, we note that theory does not clearly state which product 
characteristics or instruments should be used in the model. Theory also fails to indicate 
the functional form with which any such variables should enter the model. The high¬ 
dimensional methods outlined in this paper offer one strategy to help address these 
concerns that complements the economic intuition motivating the baseline controls and 
instruments. As an illustration, we consider an expanded set of controls and instruments. 
We augment the set of potential controls with all first order interactions of the baseline 
variables, quadratics and cubics in all continuous baseline variables, and a time trend 
that yields a total of 24 x-variables. We refer to these as the augmented controls. We 
then take sums of these characteristics as potential instruments following the original 
strategy which yields 48 potential instruments. 

We present estimation results in Table We report results obtained by applying the 
method outlined in Algorithm 1 using just the baseline set of five product characterstics 
and 10 instruments in the row labeled “Baseline 2SLS with Selection” and results ob¬ 
tained by applying the method to the augmented set of 24 controls and 48 instruments 
in the row labeled “Augmented 2SLS with Selection.” In each case, we apply the method 
outlined in Algorithm 1 using post-Lasso in each step and forcing the intercept to be 
included in all models. We employ the heteroscedasticity robust version of Post-Lasso 


of 

Belloni et al. 

(2012) following the implementation algorithm provided in Appendix A 

of 

Belloni et al. 

(2012 

). For comparison, we also report OLS and 2SLS estimates using 


only the baseline variables in “Baseline OLS” and “Baseline 2SLS,” respectively; and 
we report OLS and 2SLS estimates using the augmented variable set in “Augmented 
OLS” and “Augmented 2SLS,” respectively. All standard errors are conventional het¬ 
eroscedasticity robust standard errors. 

Considering first estimates of the price coefficient, we see that the estimated price 
coefficient increases in magnitude as we move from OLS to 2SLS and then to the selection 
based results. After selection using only the original variables, we estimate the price 
coefficient to be -.185 with an estimated standard error of .014 compared to an OLS 
estimate of -.089 with estimated standard error of .004 and 2SLS estimate of -.142 with 
estimated standard error of .012. In this case, all five controls are selected in the log- 
share on controls regression, all five controls but only four instruments are selected in the 
price on controls and instruments regression, and four of the controls are selected for the 


















26 


VICTOR CHERNOZHUKOV, CHRISTIAN HANSEN, AND MARTIN SPINDLER 


Table 2. Estimates of Price Coefficient 


Price Coefficient Standard Error Number Inelastic 
Estimates Without Selection 


Baseline OLS 

-0.089 

0.004 

1502 

Baseline 2SLS 

-0.142 

0.012 

670 

Augmented OLS 

-0.099 

0.005 

1405 

Augmented 2SLS 

-0.127 

0.014 

874 


2SLS Estimates 

With ‘ 

‘Double Selection” 

Baseline 2SLS Selection 

-0.185 

0.014 

139 

Augmented 2SLS Selection 

-0.221 

0.015 

12 


This table reports estimates of the coefficient on price (“Price Coefficient”) along with the estimated 
standard error (“Standard Error”) obtained using different sets of controls and instruments. The rows 
“Baseline OLS” and “Baseline 2SLS” respectively provide OLS and 2SLS results using the baseline set 
of variables (5 controls and 10 instruments) described in the text. The rows “Augmented OLS,” 
“Augemented 2SLS ’’are defined similarly but use the augmented set of variables described in the text 
(24 controls and 48 instruments). The rows “Baseline 2SLS with Selection” and “Augmented 2SLS 
with Selection” applies the “double selection” approach developed in this paper to select a set of 
controls and instruments and perform valid post-selection inference about the estimated price 
coefficient where selection occurs considering only the baseline variables. For each procedure, we also 
report the point estimate of the number of products for which demand is estimated to be inelastic in 
the column “Number Inelastic.” 


price on controls relationship. The difference between the baseline results is thus largely 
driven by the difference in instrument sets. The change in the estimated coefficient 
is consistent with the wisdom from the many-instrument literature that inclusion of 
irrelevant instruments biases 2SLS toward OLS. 

With the larger set of variables, our post-model-selection estimator of the price coef¬ 
ficient is -.221 with an estimated standard error of .015 compared to the OLS estimate 
of -.099 with an estimated standard error of .005 and 2SLS estimate of -.127 with an 
estimated standard error of .014. Here, we see some evidence that the original set of con¬ 
trols may have been overly parsimonious as we select some terms that were not included 
in the baseline variable set. We also see a closer agreement between the OLS estimate 
and 2SLS estimate without selection which is likely driven by the larger number of in¬ 
struments considered and the usual bias towards OLS seen in 2SLS with many weak 
or irrelevant instruments. In the log-share on controls regression, we have eight control 
variables selected; and we have seven controls and only four instruments selected in the 
price on controls and instrument regression. We also have 13 variables selected for the 
price on controls relationship. The selection of these additional variables suggests that 
there is important nonlinearity missed by the baseline set of variables. 

The most interesting feature of the results is that estimates of own-price elasticities 
become more plausible as we move from the baseline results to the results based on 
variable selection with a large number of controls. Recall that facing inelastic demand 
is inconsistent with proht maximizing price choice within the present context, so theory 
would predict that demand should be elastic for all products. However, the baseline 
point estimates imply inelastic demand for 670 products. When we use the larger set of 
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instruments without selection, the number of products for which we estimate inelastic 
demand increases to 874 with the increase generated by the 2SLS coefficient estimate 
moving back towards the OLS estimate. The use of the variable selection results provides 
results closer to the theoretical prediction. The point estimates based on selection from 
only the baseline variables imply inelastic demand for 139 products, and we estimate 
inelastic demand for only 12 products using the results based on selection from the 
larger set of variables. Thus, the new methods provide the most reasonable estimates of 
own-price elasticities. 

We conclude by noting that the simple specification above suffers from the usual draw¬ 
backs of the logit demand model. However, the example illustrates how the application 
of the methods we outlined may be used in the estimation of structural parameters in 
economics and adds to the plausibility of the resulting estimates. In this example, we 
see that we obtain more sensible estimates of key parameters with at most a modest 
cost in increased estimation uncertainty after applying the methods in this paper while 
considering a flexible set of variables. 


6. Overview of Related Literature 


Inference following model selection or regularization more generally has been an active 
area of research in econometrics and statistics for the last several years. In this section, 
we provide a brief overview of this literature highlighting some key developments. This 
review is necessarily selective due to the large number of papers available and the rapid 
pace at which new papers are appearing. We choose to focus on papers that deal 
specifically with high-dimensional nuisance parameter settings, and note that the ideas 
in these papers apply in low dimensional settings as well. 


Early work on inference in high-dimensional settings focused on inference based on 


the so-called perfect recovery; see, e.g.. Fan and Li (2001) for an early paper. Fan and Lv 


(2010) for a more recent review, and Biihlmann and van de Geer (2011) for a textbook 
treatment. A consequence of this property is that model selection does not impact the 
asymptotic distribution of the parameters estimated in the selected model. This feature 
allows one to do inference using standard approximate distributions for the parame¬ 
ters of the selected model ignoring that model selection was done. While convenient 
and fruitful in many applications (e.g. signal processing), such results effectively rely 
on strong conditions that imply that one will be able to perfectly select the correct 
model. For example, such results in linear models require the so called “beta-min con¬ 
dition” ( Biihlmann and van de Geer] (2011)) that all but a small number of coefficients 
are exactly zero and the remaining non-zero coefficients are bounded away from zero, 
effectively ruling out variables that have small, non-zero coefficients. Such conditions 
seem implausible in many applications, especially in econometrics, and relying on such 
conditions produces asymptotic approximations that may provide very poor approxi¬ 
mations to finite-sample distributions of estimators as they are not uniformly valid over 
sequences of models that include even minor deviations from conditions implying perfect 
model selection. The concern about the lack of uniform validity of inference based on 
oracle properties was raised in a series of papers, including Leeb and Potscher (2008a) 
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and Leeb and Potscher (20086) among many others, and the more recent work on post¬ 


model-selection inference has been focused on offering procedures that provide uniformly 
valid inference over interesting (large) classes of models that include cases where perfect 
model selection will not be possible. 

To our knowledge, the first work to formally and expressly address the problem of 


obtaining uniformly valid inference following model selection is Belloni et al. (ArXiv, 


20106) which considered inference about parameters on a low-dimensional set of endoge¬ 


nous variables following selection of instruments from among a high-dimensional set of 
potential instruments in a homoscedastic, Gaussian instrumental variables (IV) model. 
The approach does not rely on implausible “beta-min” conditions which imply perfect 
model selection but instead relies on the fact that the moment condition underlying IV 
estimation satisfies the orthogonality condition and the use of high-quality variable 
selection methods. These ideas were further developed in the context of providing uni¬ 
formly valid inference about the parameters on endogenous variables in the IV context 


with many instruments to allow non-Gaussian heteroscedastic disturbances in Belloni 
et al. (2012). These principles have also been applied in Belloni et al. (2010a), who 


developed approaches for regression and IV models with Gaussian errors; 


Belloni, Cher- 


nozhukov and Hansen (2014) (ArXiv 2011), which covers estimation of the parametric 
components of the partially linear model, estimation of average treatment effects, and 
provides a formal statement of the orthogonality condition ([^; [Farrell (2014) which 
covers average treatment effects with discrete, multi-valued treatments; Kozbur (2014) 


which covers additive nonparametric models; and Belloni, Chernozhukov, Hansen and 


Kozbur (2014) which extends the IV and partially linear model results to allow for fixed 
effects panel data and clustered dependence structures. The most recent, general ap¬ 


proach is provided in Belloni, Ghernozhukov, Fernandez-Val and Hansen (2013) where 


inference about parameters defined by a continuum of orthogonalized estimating equa¬ 
tions with infinite-dimensional nusiance parameters is analyzed and positive results on 


inference are developed. The framework in Belloni, Chernozhukov, Fernandez-Val and 


Hansen (2013) is general enough to cover the aforementioned papers and many other 


parametric and semi-parametric models considered in economics. 

As noted above, providing uniformly valid inference following model selection is closely 
related to use of Neyman’s C(a)-statistic. Valid confidence regions can be obtained by 
inverting tests based on these statistics, and minimizers of C(Q;)-statistics may be used 
as point estimators. The use of C{a) statistics for testing and estimation in high¬ 
dimensional approximately sparse models was first explored in the context of high¬ 


dimensional quantile regression in Belloni, Chernozhukov and Kato (20136) (Oberwol- 
fach, 2012) and Belloni, Chernozhukov and Kato (2013a) and in the context of high¬ 


dimensional logistic regression and other high-dimensional generalized linear models by 
Belloni, Chernozhukov and Wei (2013). More recent uses of C'(a)-statistics (or close 


variants, under different names) include those in Voorman et al. (2014), Ning and Liu 
(2014), and Yang et al. (2014|) among others. 


There have also been parallel developments based upon ex-post “de-biasing” of esti¬ 
mators. This approach is mathematically equivalent to doing classical “one-step” correc¬ 
tions in the general framework of Section 2. Indeed, while at first glance this “de-biasing” 
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approach may appear distinct from that taken in the papers listed above in this section, 
it is the same as approximately solving - by doing one Gauss-Newton step - orthogonal 
estimating equations satisfying Q. The general results of Section 2 suggest that these 
approaches - the exact solving and “one-step” solving - are generally first-order asymp¬ 
totically equivalent, though higher-order differences may persist. To the best of our 
knowledge, the “one-step” correction approach was first employed in high-dimensional 


sparse models by Zhang and Zhang (2014) (ArXiv 2011) which covers the homoscedastic 


linear model (as well as in several follow-up works by the authors). This approach has 


been further used in van de Geer et al. (2014) (ArXiv 2013) which covers homoscedastic 


linear models and some generalized linear models, and Javanmard and Montanari (2014) 


(ArXiv 2013) which offers a related, though somewhat different approach. Note that Bel 


loni, Ghernozhukov and Kato (20136) and Belloni, Ghernozhukov and Wei (2013) also 


offer results on “one-step” corrections as part of their analysis of estimation and infer¬ 
ence based upon the orthogonal estimating equations. We would not expect that the 
use of orthogonal estimating equations or the use of “one-step” corrections to dominate 
each other in all cases, though computational evidence in Belloni, Ghernozhukov and 


Wei (2013) suggests that the use of exact solutions to orthogonal estimating equations 
may be preferable to approximate solutions obtained from “one-step” corrections in the 
contexts considered in that paper. 

Another branch of the recent literature takes a complementary, but logically distinct, 
approach that aims at doing valid inference for the parameters of a “pseudo-true” model 
that results from the use of a model selection procedure, see Berk et al. (2013). Specif¬ 


ically, this approach conditions on a model selected by a data-dependent rule and then 
attempts to do inference - conditional on the selection event - for the parameters of the 
selected model, which may deviate from the “true” model that generated the data. Re- 


iated developments within this approach appear in 

G’Sell et al. 

(2013) 

Lee and Taylor 

(2014) 

, Lee et al. 

(2013) 

Lockhart et al.|(|2014), Loftus and Taylor 

(2014), Taylor et al. 

(2014) 

, and 

Fithian et al. ( 

2014 

). It seems intellectually very interesting to combine 


the developments of the present paper (and other preceding papers cited above) with 
developments in this literature. 

The previously mentioned work focuses on doing inference for low dimensional pa¬ 
rameters in the presence of high dimensional nuisance parameters. There have also been 
developments on performing inference for high dimensional parameters. Ghernozhukov 


(2009) proposed inverting a Lasso performance bound in order to construct a simulta¬ 
neous, Scheffe-style confidence band on all parameters. An interesting feature of this 
approach is that it uses weaker design conditions than many other approaches but re¬ 
quires the data analyst to supply explicit bounds on restricted eigenvalues. 


and Tsybakov (2011) (ArXiv 2011) and Ghernozhukov et al. (2013) employ similar ideas 


Gautier 


while also working with various generalizations of restricted eigenvalues, van de Geer and| 
Nickl (2013) construct confidence ellipsoids for the entire parameter vector using sample 


splitting ideas. Somewhat related to this literature are the results of Belloni, Cher 


nozhukov and Kato (20136) who use the orthogonal estimating equations framework 


with infinite-dimensional nuisance parameters and construct a simultaneous confidence 
rectangle for many target parameters where the number of target parameters could be 
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much larger than the sample size. They relied upon the high-dimensional central limit 


theorems and bootstrap results established in Chernozhukov et al. (2013). 


Most of the aforementioned results rely on (approximate) sparsity and related sparsity- 
based estimators. Some examples of the use of alternative regularization schemes are 


available in the many instrument literature in econometrics. For example, Chamberlain 


and Imbens (2004) use a shrinkage estimator resulting from use of a Gaussian random 


coefficients structure over first-stage coefficients, and Okui (2010) uses ridge regression 


for estimating the first-stage regression in a framework where the instruments may be 
ordered in terms of relevance. Carrasco (2012) employs a different strategy based on 


directly regularizing the inverse that appears in the definition of the 2SLS estimator 
allowing for a number of moment conditions that are larger than the sample size; see 


also Carrasco and Tchuente Nguembu (2012). The theoretical development in Carrasco 


(2012) relies on restrictions on the covariance structure of the instruments rather than on 


the coefficients of the instruments. Hansen and Kozbur (2014) considers a combination 


of ridge-regularization and the jackknife to provide a procedure that is valid allowing for 
the number of instruments to be greater than the sample size under weak restrictions on 
the covariance structure of the instruments and the first-stage coefficients. In all cases, 
the orthogonality condition holds allowing root-n consistent and asymptotically normal 
estimation of the main parameter a. 

Many other interesting procedures beyond those mentioned in this review have been 
developed for estimating high-dimensional models; see, e.g. Hastie et al. (2009) for a 


textbook review. Developing new techniques for estimation in high-dimensional settings 
is also still an active area of research, so the list of methods available to researchers 
continues to expand. The use of these procedures and the impact of their use on in¬ 
ference about low-dimensional parameters of interest is an interesting research direction 
to explore. It seems likely that many of these procedures will provide sufficiently high- 
quality estimates that they may be used for estimating the high-dimensional nuisance 
parameters r] in the present setting. 


Appendix A. The Lasso and Post-Lasso Estimators in the Linear Model 

Suppose we have data {yi,Xi} for individuals i = l,...,n where Xi is a p-vector of 
predictor variables and m is an outcome of interest. Suppose that we are interested in a 
linear prediction model for m, m = x^y + Si, and define the usual least squares criterion 
function: 

1 ” 

Qiv) ■■= - x'iVif. 

2=1 

The Lasso estimator is defined as a solution of the following optimization program: 
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where A is the penalty level and 3.re covariate specific penalty loadings. The 

covariate specific penalty loadings are used to accommodate data that may be non- 
Gaussian, heteroscedastic, and/or dependent and also help ensure basic equivariance of 
coefficient estimates to rescaling of the covariates. 

The Post-Lasso estimator is defined as the ordinary least square regression applied to 
the model I selected by Lassoj^ 

/ = support(77L) = {j G {1, • • • ,p} : IVLjl > 0}. 

The Post-Lasso estimator ffpL is then 

rjpi G argmin{Q(i?) : rj such that rjj = 0 for all j ^ /} (57) 

In words, this estimator is ordinary least squares (OLS) using only the regressors whose 
coefficients were estimated to be non-zero by Lasso. 


Lasso and Post-Lasso are motivated by the desire to predict the target function well 
without overfitting. The Lasso estimator is a computationally attractive alternative 
to some other classic approaches, such as model selection based on information crite¬ 
ria, because it minimizes a convex function. Moreover, under suitable conditions, the 
Lasso estimator achieves near-optimal rates in estimating the regression function x(t/. 
However, Lasso does suffer from the drawback that the regularization by the .^i-norm 
employed in (56) naturally shrinks all estimated coefficients towards zero causing a po¬ 
tentially significant shrinkage bias. The Post-Lasso estimator is meant to remove some 
of this shrinkage bias and achieves the same rate of convergence as Lasso under sensible 
conditions. 


Practical implementation of the Lasso requires setting the penalty parameter and load¬ 
ings. Verifying good properties of the Lasso typically relies on having these parameters 
set so that the penalty dominates the score in the sense that 

“ “ ' or, equivalently 


n 


> max 2c 
j<p 


n 


E 

i=l 




> max 2c 
j<p 




for some c > 1 with high probability. Heuristically, we would have the term inside the 
absolute values behaving approximately like a standard normal random variable if we set 
tjjj = Var ■ We could then get the desired domination by setting 

large enough to dominate the maximum of p standard normal random variables with 
high probability, for example, by setting A = 2c^/n^~^ (1 — .l/[2plog(n)]) where 
denotes the inverse of the standard normal cumulative distribution function. Verifying 
that this heuristic argument holds with large p and data which may not be i.i.d. Gaussian 


in, for example. 

Belloni et al. 

(2012 

) which 

Belloni, Chernozhukov, Hansen and Kozbur 


(2014) which covers panel data with within individual dependence. The choice of the 
penalty parameter A can also be refined as in Belloni et al. ( 2011[ ). Finally, feasible 
implementation requires that i/’j be estimated which can be done through the iterative 


°We note that we can also allow the set I to contain additional variables not selected by Lasso, but 
we do not consider that here. 
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procedures suggested in Belloni et al. (2012) or Belloni, Chernozhukov, Hansen and 


Kozbur (2014). 


Appendix B. Proofs 


B.l. Proof of Proposition]^ Consider any sequence {Pn} in {Pn}- 

Step 1 (r„-rate). Here we show that ||d — ooll < fn wp —)• 1. We have by the 
identifiability condition, in particular the assumption mineig(r']^ri) > c, that 

Pn(||d - ooll > rn) < Pn(||M(Q;,r7o)|| > i{rn)), i{rn) := 2“^({Vcr„} A c). 


Hence it suffices to show that wp —)• 1, ||M(d!, r/o) || < i{rn)- By the triangle inequality, 


||M(d,ryo)|| < /i + /2 + h, 


h = ||M(q;,?7o) - M(d;,r})||, 
h = ||M(d,??) - M(d, 77 ) 11 , 
h = ||M(q;,?7)||. 


By assumption (12), wp —>• 1 

Ii + l2< o(l){r„ + I3 + ||M(d, 77 o)||}. 


Hence, 

||M(d,7/0)11(1 - 0 ( 1 )) < o(l)(r„ + 73 ) + / 3 . 
By construction of the estimator, 

h < o(n“^/2) + inf ||M(a,7))|| <p^ 77 “^/^ 

OL^A 

which follows because 


inf ||M(a, 77)11 < ||M(a,i 

OL^A 


<P n 


(58) 


where a is the one-step estimator defined in Step 3, as shown in (59). Hence wp 

||M(d,7/o)|| < o{rn) < i{rn), 


where to obtain the last inequality we have used the assumption mineig(r(^ri) > c. 

Step 2 (77“^/^-rate). Here we show that ||d — oqII ^Pn By condition (14) and 

the triangle inequality, wp —)■ 1 

||M(d,?7o)|| > ||ri(d - 00)11 -o(l)||d-ao|| > (Vc - o(l))||(d - ao)|| > Vc/2||(d - ao)||. 


Therefore, it suffices to show that ||M(d, 7/o)|| <p„ n We have that 

III = ||M(d,7/o) - M(d,7))||, ^ 

||M(d,7/o)|| < Ih + II 2 + Ih, Ih = ||M(d,77) - M(a,77) - M(ao,i/o)||, 

Ih = l|M(d, 77)11 -7 ||M(ao,i?o)||- 


Then, by the orthogonality 9^/M(ao!i?o) = 0 and condition ( [l4j ), wp —)• 1, 

Ih < ||M(d,77) - M(d,7/o) - 9r,'M(d,77o)[7) - 7/o]|| -7 ||<9r,'M(d,7/o)[7? - 7/o]|| 

< o(l)n“^/^-I- o(1)||q: — ooll 

< o(l)n"^/2 -7o(l)(2/Vc)||M(d,77o)||. 
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Then, by condition (13) and by I 3 <p„ n 


II 2 < o(l){n + ||M(d,r))|| + ||M(d,77o)||} 


<t 


o(l){n + n + ||M(Q;,r/o)||}. 


Since 1/3 <p„ n by (58) and ||M(ao, f/o)|| ^P„ U' ) it follows that wp —)• 1 , 

(1 - o(l))||M(d,?7o)|| 


'P n 

n 


- 1/2 


Step 3 (Linearization). Define the linearization map a 1 —?■ L(q;) by L(q;) := M(ao, i?o) + 
ri(a — ao). Then 


nil = ||M(d,f/) - M(q;,??o)||, 

||M(d,? 7 ) -L(d)|| < Ilh +1112 + 1113 , Ilh = ||M(d,r/o) -ri(d-ao)J|, 

III 3 = ||M(d,r)) - M(d, 77 ) - M(ao,r/o)||. 

Then, using the assumptions (0 and ( [I^ , conclude 

nil < \\M{a,f])-M{a,rjo) - dn'M{a,r]o)[fi-r]o]\\+ \\drj'M{a,r]o)[ri-r]o]\\ 

< o(l)n“^/^ + o(l)||d - aoll, 

111 2 < o(1)||q; - aoll, 

111 3 < o(l)(n“fo2 ||M(d, 7 ?)|| + ||M(d,r?o)||) 

< o(l)(n“^/^ + + III 2 + ||ri(d - ao)||)- 

Conclude that wp —)• 1 , since llT'^TiH < 1 by assumption 0 , 

||M(d, fj) - L(d)|| <p„ o(l)(n“^/2 + ||d - aoll) = o(n“^/2). 

Also consider the minimizer of the map a 1 —)• ||L(a)||, namely, 

d = ao - (riri)“^riM(ao,r?o) 

which obeys \\^/n{a — ao)|| <p„ under the conditions of the proposition. We can 

repeat the argument above to conclude that wp —)• 1, ||M(d,? 7 ) — L(d)|| <p„ o(n“^/^). 
This implies, since ||L(d)|| <p^ 

l|M(d,7?)|| <p„n-V2. (59) 

This also implies that ||L(d)|| = ||L(d)|| + op„(n“^/^), since ||L(a)|| < ||L(d)|| and 

l|L(d)|| -op„(n"^/ 2 ) < ||M(d,r))|| < ||M(d, 7 ))|| + 0 ( 71 "^/^) ^ ||L(a)|| + op„(n-^/ 2 )^ 
The former assertion implies that ||L(d)|p = ||L(d)|p + op„(n“^), so that 

||L(d)f - ||L(d)f = ||ri(d - d)f = op„(n-^). 


from which we can conclude that -v/n||d — a|| —)-p„ 0 . 

Step 4. (Conclusion). Given the conclusion of the previous step, the remaining claims 
are standard and follow from the Continuous Mapping Theorem and Lemma ■ 
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B.2. Proof of Proposition!^ We have wp —)• 1 that, for some constants 0 < u < I < 0, 
/||x|| < ||Ax|| < ?r||x|| and l\\x\\ < ||Ax|| < n||x||. Hence 


sup 

a&A 


II AM°(a, fj) - AM°{a, t)) || + || AM°(a, t)) - AM°(a, 770 ) || 
rn + ||AM°(a,r/)|| + ||AM°(a, ryo)|| 


^ u ||M°(a,r}) - M°{a,fi)\\ + \\M°ia,fi) - M'’(a,r?o)|| 

< sup --^- 

aeA ^ (rn/l) + \\M'^{a,fi)\\ + ||M°(a,r/o)|| 


||A-A||||M°(a,i})|| ^ 

+ sup -^ 


0 ( 1 ) + ||A - A||// —0. 


oeA + Z||M°(a,r})|| 

The proof that the rest of the conditions hold is analogous and is therefore omitted. 


B.3. Proof of Proposition]^ Step 1. We define the feasible and infeasible “one-steps’' 

a = a- FM{a,fj), F = 
a = ao - FM{ao,rio), F = (riri)“^r^. 

We deduce by (20) and ( [IT| ) that 


|.F|| <P 1, IlFTi— /||<p rn, IlF — F|| <p r„. 

I M rsjr-fi 5 M J- M r^± n M M fi •i' 


Step 2. By Step 1 and by condition (21), we have that 


D = \\FM{a,fi) - FM{ao,r]o) - FTi{a - ao)|| 

< ||F||||M(d,?}) - M(ao,??o) - ri(d - ao)|| 

<P„ ||M(d, 77 ) - M(d,f)) - M(ao,i?o)|| + Di <P„ + Di, 


where Di := ||M(d, fj) — Ti{a — ao)||- 


Moreover, Di < IVi + IV 2 + /V 3 , where wp —)• 1 by condition ( 21 ) and = o(n 


IVi := ||M(q;,??o) - ri(d - ao)|| < ||d - aof < 

IV 2 := ||M(d,77) - M(d,r?o) - r/o)[?7 - %]|| < 

IV 3 := ||(9^/M(d,ryo)[i?-r?o]|| < o(n“^/ 2 )_ 

Conclude that —)-p„ 0. 

Step 3. We have by the triangle inequality and Steps 1 and 2 that 


\/n||a — a 


< Vn\\il - FTi){a - ao)|| + Vn\\{F - F)M(ao,%)|| + \/nD 

< ^/^||(/ - M)||||« - aoll + 11^ - F||||\/^M(ao,%)|| + 

^Pn Vnrl + o{l) = 0(1). 


Thus, -v/nlld — q;|| —)-p„ 0, and -v/n||A — ci|| —>'P„ 0 follows from the triangle inequality 
and the fact that -v/n||Q; — q;|| —)-p„ 0. ■ 
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B.4. Proof of Lemma 2. The conditions of Proposition 1 are clearly satisfied, and thus 
the conclusions of Proposition 1 immediately follow. We also have that, for Pi = ri(i7), 

y/n{a - oo) = -T\/nM(ao,f/), F = 

y/n{a - oo) := -Fy/niA{ao, rjo), F = (riri)“^ri. 

We deduce by (33) and (11) that ||T|| <p„ 1 and ||F — F\\ —)-p„ 0. Hence we have by 
triangle and Holder inequalities and condition (33) that 

Vn\\a - d|| < ||T||V^||M(ao,^) - M(ao,??o))|| + 11-^ - F\\Vn\\M{ao,m)\\ “^Pn 0- 

The conclusions regarding the uniform validity of inference using d, of the form stated in 
conclusions of Proposition 2, follow from the conclusions regarding the uniform validity 
of inference using d, which follow from the Continuous Mapping Theorem, Lemmaj^ and 
the assumed stability conditions (11). This establishes the second claim of the Lemma. 
Verification of the conditions of Proposition 2 is omitted. ■ 


B.5. Proof of Lemma 3 and 4. The proof of Lemma 3 is given in the main text. As 
in the proof of Lemma 3, we can expand: 

Vn{Mj{ao, fi) - Mj(ao, Vo)) = Tij + T 2 ,j + (60) 

where the terms (Tz,j)f=i are as defined in the main text. We can further bound as 
follows: 






Tg- := V^\{fi - r?^)'d^d^,M,(ao)(f/ - C)l> 
r 4 j := ^/n\r]J)'^r,^^'Mj{ao)vo\■ 


Then Tij = 0 by orthogonality, T 2 J —>-p„ 0 as in the proof of Lemma 3. Since 
s^log(pn)^/n —)■ 0, vanishes in probability because, by Holder’s inequality and for 

sufficiently large n, 

T^j < VnfsjWrj - <p„ ^/^slog{pn)/n ^p„ 0. 

Also, if s^log(pn)^/n —>■ 0, vanishes in probability because, by Holder’s inequality 
and (43), 


Tij < \/n||d^(9^/Mj(ao)||pw(,,5)hS 


r\\2 


<i 


nslog(pn)/n —>-p„ 0. 


The conclusion follows from (60). 


B.6. Proof of Lemma 5. For m = 1,..., k and I = 1,..., d, we can bound each element 
Pi,mz(i/) of matrix ri(T/) as follows: 


4 

f'lMiv) -'^l,mliVo)\ sE 

k=l 


Tgml 

T2,ml 

Ti^ml 


\dnTgmi{vo)'{v-'qo)\, 

\{d'qt i^rnliVo) - 9^'^ I,mliVo))'{V - Vo)\, 
\{rj - vj^)%dr,>ti^rnl{v “ C)l> 
\r]^'dr^dr,'ti^rnlVo\- 
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Under conditions (44) and (45) we have that wp —)• 1 

Ti^mi < \\dn^i,miiVo)\\oo\\'n - mh ^Pn log(pn)/n 0, 

T 2 ,ml < Wdr^ti^rnlivo) “ (l?o) || oo ||I? “ r?o||l <P„ log(pn)/n 0, 

T 3 ,mi < \\dr,dr^'ti^rni\\spie„s)\\v “ ^P„ slog(pn)/n 0, 

TA,ml < \\drjdr,'ti^rnl\\pwiri-^)\\Vof ^P„ slog(pn)/n ^ 0. 

The claim follows from the assumed growth conditions, since d and k are bounded. ■ 


Appendix C. Key Tools 


Let 4> and 4> ^ denote the distribution and quantile function of AA(0,1). Note that in 
particular <h“^(l — a) < ^J2 log(o) for all a G (0,1/2). 


Lemma 6 (Moderate Deviation Inequality for Maximum of a Vector). Suppose that 
Sj := ^ij / ^ij > where Uij are independent random variables across i with 

mean zero and finite third-order moments. Then 


(mg_|5,|>3.-'(l-7/2p)) < 7(1 + 4 ) 


where A is an absolute constant, provided for t'n > 0 


0 < 4>-i(l -7/(2p)) < 


n 


1/6 


min M? — 1, M,- := 

i<i<p ^ ^ 




1/2 




1/3- 


This result is essentially due to Jing et al. (2003). The proof of this result, given 


m 


Belloni et al. (2012), follows from a simple combination of union bounds with their 


result. 


Lemma 7 (Laws of Large Numbers for Large Matrices in Sparse Norms). Let Sn, Pn, kn 
and in be sequences of positive constants such that £„ —)• 00 but Inj logn —)■ 0 and ci and 
C 2 be fixed positive constants. Let be i.i.d. vectors such that ||E[xjx(]||sp(s^iogn) < 

Cl, and either one of the following holds: (a) Xi is a sub-Gaussian random vector with 
sup||,i||<i ||x'u||^ 2 ^p < C 2 , where || • ||^ 2 ,P denotes the 'ip 2 -Orlizs norm of a random variable, 
and Sn(logre)(log(pn V n))/n —)• 0; or (b) ||xi||oo < kn a.s. and A:^s„(log^ n) log(pn V 
n)/n —)• 0. Then there is o(l) term such that with probability 1 — o(l).‘ 

\\&n[Xix'i] - E[xix'] ||sp(5„r„) < 0 ( 1 ), \\&n[Xix'i] ||sp(^„r„) < Cl + 0 ( 1 ). 


Under (a) the result follows from Theorem 3.2 in 

Rudelson and Zhou (2011 

) and under 

(b) the result follows from Rudelson and Vershynin 

(2008 

), as shown in the Supplemental 

Material of Belloni and Chernozhukov 

(2013). 




Lemma 8 (Useful implications of CLT in M™). Consider a sequence of random vectors 
Zn in such that Zn Z = The elements of the sequence and the limit 
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variable need not he defined on the same probability spaee. Then 

lim sup \¥’{Zn £ R) — P(Z G -R)| =0, 
R&n 

where TZ is the collection of all convex sets in 


Proof. Let R denote a generic convex set in M™. Let R'^ = {z £ : d{z,R) < e} 

and R~^ = {z £ R : B{z,e) C R}, where d is the Euclidean distance and B{z,e) = 

{y £ : d{y,z) < e}. The set R'' may be empty. By Theorem 11.3.3 in Dudley] 

( |2002 ), Cn := p{Zm Z) —)• 0, where p is the Prohorov metric. The definition of the metric 
implies th at PjZ^ £ R) < P{Z £ + Cn- By the reverse isoperimetric inequality 

[Prop 2.5. Chen and Fang (2011)] \P{Z £ R^") — P(Z £ i?)| < Hence P(Zn £ 

R) < P{Z £ R) + e„(l + Furthermore, for any convex set R, {R~^^Y^ C R 

(interpreting the expansion of an empty set as an empty set). Hence for any convex R 
we have P[Z £ R~^") < P(Zn £ R) + Cn by definition of Prohorov’s metric. By the 
reverse isoperimetric inequality \P[Z £ — P(Z G i?)| < rrfil'^en- Conclude that 

P{Z^£R)>P{Z £R)-en{\Pm^l‘^). ■ 
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Supplemental Appendix for “Valid Post-Selection 
and Post-Regularization Inference: An Elementary, 
General Approach” 

In this supplement, we provide proofs for the results from Section 5 of “Valid Post- 
Selection and Post-Regularization Inference: An Elementary, General Approach”. Equa¬ 
tion numbers (1)-(61) refer to equations defined in the main text, and equation numbers 
(62) and greater are defined in this supplement. 


Appendix D. Proof of Proposition 5 

We present the proofs for the case of = 1; the general case follows similarly. 

We proceed to verify the assumptions of Lemma 4 and 5, from which the desired 
result follows from Propositions 1 and 2 and Lemma 2. In what follows, we consider an 
arbitrary sequence {Pn} in {Pn}- 

Step 1. (Performance bounds for fj). We noted that Condition AS.l implies the 
decomposition (50). A modification of the proofs of Belloni et al. (2012) yields the 
following performance bounds for estimator r) of t/q: wp —)• 1, 


< 


s, V - 


Vo'h ^ y(s/n)log(pn). 


Vo'Wi ^ \/(s2/n) log(pn). 


(62) 


Note that the required modification addresses two differences between the development 


in the present paper and that in Belloni et al. (2012). First, we impose only that errors 


are uncorrelated with control regressors and instruments whereas mean independence 


between errors and controls and instruments is assumed in Belloni et al. (2012). Second, 


the third step of Algorithm 1 presented in main text requires regressing an estimated 
response variable on the control regressors. This extension is handled by noting that 
the estimation error in the response variable can be treated as additional approximation 


error in the proofs given in Belloni et al. (2012). We omit these details for brevity and 
as they are straightforward. 

Step 2. (Preparation). It is convenient to lift the nuisance parameter rj into a higher 
dimension and redefine the signs of its components as follows: 

V ■= := [-O', 

With this re-definition, we have 

Tp{wi, a, 7]) = {{yi + x'r/i) - {di + x'r/2)a}{a:ii?3 + 4m + 4m}- 

Note also that 

M(q;, 77) = ri(?7)a -h r2(i?), = f 1(77)0; -h ^(i?), 


ri(r7) = E[da4{wi,a,r})], fi(r/) =En[da4’i'Wi,Oi,rj)]. 
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We compute the following partial derivatives; 

dn'i^iwi) := dn'ipiwi,ao,T]o) = 

da-ipiiVi, a, rf) = -{di + x'7?2}{a;'??3 + + a:'7?5}, 

dai’iwi) ■= daipiwi,ao,r]o) = -pfgi, 

drjfdai^iWi) := da'll){Wi, ao, %) = [ 0 , -x'^Qi, -x'ipf, -z'ipf, -x'iPi]', 


dndjj''ip{wi,a,p) = 


dndjj>da'il){wi,a,p) 


0 

0 

XiXj^ 

XiZ[ 

XiX[ 

0 

0 

-axixl 

-axiz[ 

—axiX 

XiX[ 

-axix[ 

0 

0 

0 

ZiX\ 

-azix[ 

0 

0 

0 

XiX\ 

-axix[ 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

-Xix'i 

1 

H 

Jl 

1 

Xjx' 

0 

-Xix'i 

0 

0 

0 

0 

-Zix'i 

0 

0 

0 

0 

-Xix'i 

0 

0 

0 


Step 3. (Verification of Conditions of Lemma 4). 

Application of Lemma 6, condition ||ao|| < C holding by Condition SM, and Condition 
RF, yields that wp —)■ 1, 

^/n\\^^M{ao,po) - d^^{ao,po)\\ao = ||\/n(E„ - E)5^V'(at'i)l|oo < 

< max ^ [En[{dr,'il){wi))‘^j] Vlog(pn) <p„ ^/log{pn). 


Application of the triangle inequality, condition ||ao|| < C, and Condition RF yields 

||a^9^.M(ao,??o)||sp(£„s) < C\\drjdr,'En[fifi]\\sp{e^s) <p„ 1, 
where C depends on C. 

Moreover, application of the triangle inequality and the Markov inequality yields, for 
any deterministic a / 0, 

||9^9^/M(ao,a/o)||pw(a) < C'l|5r,5r,'IEn[/i/*]||pw(a) ^Pn 1, 
where C depends on C. 

We have by Condition SM and the law of iterated expectations: 

n = E['tp‘^{wi,ao,m)] = H^hl] ^ Hel] ■ [c, c] g [c^, c^], 

E[i;^/\wi,ao,po)] < E[\sigi\^^^] < < C. 

Application of Lyapunov’s Central Limit Theorem yields, 

n-^/^M{ao,Vo) ^ W(0,1). 
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Next, ^(ao) = O(o,f})] is consistent for ri. The proof of this result follows 

similarly to the (slightly more difficult) proof of consistency of O = d, ??)] for 

O, which is given below. 

All conditions of Lemma 4 are now verified. 

Step 4. (Verification of Conditions of Lemma 5). 

Application of Lemma 6 and Conditions RF yields that with probability 1 — o(l), 
\/n||9,,fi(ao,ho) - drjTi{ao,r]o)\\cx, = ||\/n(E„ - E)drjdaipiwi)\\oo 
< max [Eniidrjdaipiwi))]] Vlog(pn) <p„ ^/log{pn). 

Application of the triangle inequalities and Condition RF yields: 

1 (cCq, I/o) ||sp(£„s) ^ \\9ridri'^n[fifi\\sp{ins) ^Pn 

Moreover, application of the triangle inequalities and the Markov inequality yields, 
for any deterministic a ^ 0, 

||9^9^/fi(ao,I?o)||pw(a) < \\dr,dn'^n[fifi\\pwia) <p„ 1- 

Next, by Condition SM we have 

Ti = E[pfe,] = E[ej]€[c,C]. 

By Condition RF we have 

||5^Fi(ao,r/o)||oo = ||E[d^/9„V’(^i’i)]||oo < max (E[\fijgi\] V E[\fijpf\]j 

< max ^Y^[E[|/ij£>i|2] V ^J\E[\fijpf\‘^]^ < VC. 

This, as well as previous steps verify conditions of the Lemma 5, which are sufficient to 
establish that Id — ciol ^P„ which is needed in the last step below. 

Next, we show Vn — Vn —^p„ 0. Given the stability conditions established above, this 
follows from ri(r)) — Fi —7-p„ 0, which follows from Lemma 5, and from —^p„ 0. 

Recall that (l = Kn[ip‘^{wi,a,fi)] and let CIq = E„[V’^(rci, ao, i?o)]- Since CIq — Q -^p„ 0 
by the Markov inequality, it suffices to show that Q — Qq —?-p„ 0. Since 0, — 0,q = 

(V^— '\/f^)(\/^+ \/^), it suffices to show that — \/^) —^-p^ 0. By the triangle 
inequality and some simple calculations, we have 

I - \/S)| < D := hloo^^niQf] + hloolhlloo + Ihlloo^^nisf], 

where the terms are defined below. Let 

ii = p^i - pfa, Si = pf - pfa, 

Qi = x'q + Zi 5 - x'i'd, Qi = x'70 + z'ido - x'i?o- 
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Then 

\si - ei\ < |x'(6'o - 0)1 + \pf{oiQ - a)| + |x'(-0 - t?o)ao| + |a:'(-0 - 'i?o)(a - ao)|, 

\Qi - Qil < \z'i{6 - 5o)| + - 7o)| + \x'i{d - i?o)|- 

Then the terms I 2 , Icxi, Ih, Hoc are defined and bounded, using elementary inequalities 
and Condition RF, as follows: 

I2 := - Eif] <p„ v^slog(pn)/n + ^J^n[pf]\a - ao| 

+ \/slog(pn)/n|a;o| + \/s log(pn)/n|d - ao| -^p„ 0, 

/oo := max|ei - ej| < max |/^|log(pn)/n + max \pi\\a - ao| 

i<n ij i<n 

+ max I/jd log(pn)/nIaoI + max|/jd log(pn)/n|Q; - ao| -^p„ 0, 

b d 

Ih ■= - piY] <p„ v^slog(pn)/n 0, 

//oo := max|^j - Qi\ < max | fij\ log(pn)/n -^p„ 0, 

i<n ij 

where we have used the relations, |d-ao| <p„ andE„[|/9fp] <p„ 1, maxj<„/9f <p^ 

for q > 4, holding by Condition SM, and we have used the fact that 

E„[(x'{0 - 0o})" + {x'S - do] f + (x'{7 - 70 })' + - <5o})2] <P„ 

The latter follows from the following argument, for example, wp —)■ 1, 

E„[(x'{0 - 0o})'] < 2E„[(x'{0 - e^}f] + 2E„[(x'0S)'] 

< Pn[/*/']||sp(^„.)l|0-0O™f + Pn[/J']||pw(e^^^ 

^P„ slog(pn)/n, 

since wp 1 ||(9 - e^\\o < InS, ||0 - < slog(pn)/n, ||0);|p < s/n, by Step 1 

and Condition AST (see decomposition (50)), and \\^n[fifi]\\sp{e„s) ^P„ 1 holding by 
Condition RF and ||E„[/i/j']||pv„( 5 )r) <p^ 1 holding by Markov inequality and Condition 
RF. 

Since +Kn[ef] <p„ 1 by Condition SM, we conclude that D —)'p„ 0. ■ 
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